Commit Graph

1571 Commits

Author SHA1 Message Date
Djalal Harouni
bc452b4b65 proc: do not allow negative offsets on /proc/<pid>/environ
__mem_open() which is called by both /proc/<pid>/environ and
/proc/<pid>/mem ->open() handlers will allow the use of negative offsets.
/proc/<pid>/mem has negative offsets but not /proc/<pid>/environ.

Clean this by moving the 'force FMODE_UNSIGNED_OFFSET flag' to mem_open()
to allow negative offsets only on /proc/<pid>/mem.

Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Djalal Harouni
e8905ec27e proc: environ_read() make sure offset points to environment address range
Currently the following offset and environment address range check in
environ_read() of /proc/<pid>/environ is buggy:

  int this_len = mm->env_end - (mm->env_start + src);
  if (this_len <= 0)
    break;

Large or negative offsets on /proc/<pid>/environ converted to 'unsigned
long' may pass this check since '(mm->env_start + src)' can overflow and
'this_len' will be positive.

This can turn /proc/<pid>/environ to act like /proc/<pid>/mem since
(mm->env_start + src) will point and read from another VMA.

There are two fixes here plus some code cleaning:

1) Fix the overflow by checking if the offset that was converted to
   unsigned long will always point to the [mm->env_start, mm->env_end]
   address range.

2) Remove the truncation that was made to the result of the check,
   storing the result in 'int this_len' will alter its value and we can
   not depend on it.

For kernels that have commit b409e578d ("proc: clean up
/proc/<pid>/environ handling") which adds the appropriate ptrace check and
saves the 'mm' at ->open() time, this is not a security issue.

This patch is taken from the grsecurity patch since it was just made
available.

Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Linus Torvalds
83c7f72259 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
 "Notable highlights:

   - iommu improvements from Anton removing the per-iommu global lock in
     favor of dividing the DMA space into pools, each with its own lock,
     and hashed on the CPU number.  Along with making the locking more
     fine grained, this gives significant improvements in multiqueue
     networking scalability.

   - Still from Anton, we know provide a vdso based variant of getcpu
     which makes sched_getcpu with the appropriate glibc patch something
     like 18 times faster.

   - More anton goodness (he's been busy !) in other areas such as a
     faster __clear_user and copy_page on P7, various perf fixes to
     improve sampling quality, etc...

   - One more step toward removing legacy i2c interfaces by using new
     device-tree based probing of platform devices for the AOA audio
     drivers

   - A nice series of patches from Michael Neuling that helps avoiding
     confusion between register numbers and litterals in assembly code,
     trying to enforce the use of "%rN" register names in gas rather
     than plain numbers.

   - A pile of FSL updates

   - The usual bunch of small fixes, cleanups etc...

  You may spot a change to drivers/char/mem.  The patch got no comment
  or ack from outside, it's a trivial patch to allow the architecture to
  skip creating /dev/port, which we use to disable it on ppc64 that
  don't have a legacy brige.  On those, IO ports 0...64K are not mapped
  in kernel space at all, so accesses to /dev/port cause oopses (and
  yes, distros -still- ship userspace that bangs hard coded ports such
  as kbdrate)."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
  powerpc/mpic: Create a revmap with enough entries for IPIs and timers
  Remove stale .rej file
  powerpc/iommu: Fix iommu pool initialization
  powerpc/eeh: Check handle_eeh_events() return value
  powerpc/85xx: Add phy nodes in SGMII mode for MPC8536/44/72DS & P2020DS
  powerpc/e500: add paravirt QEMU platform
  powerpc/mpc85xx_ds: convert to unified PCI init
  powerpc/fsl-pci: get PCI init out of board files
  powerpc/85xx: Update corenet64_smp_defconfig
  powerpc/85xx: Update corenet32_smp_defconfig
  powerpc/85xx: Rename P1021RDB-PC device trees to be consistent
  powerpc/watchdog: move booke watchdog param related code to setup-common.c
  sound/aoa: Adapt to new i2c probing scheme
  i2c/powermac: Improve detection of devices from device-tree
  powerpc: Disable /dev/port interface on systems without an ISA bridge
  of: Improve prom_update_property() function
  powerpc: Add "memory" attribute for mfmsr()
  powerpc/ftrace: Fix assembly trampoline register usage
  powerpc/hw_breakpoints: Fix incorrect pointer access
  powerpc: Put the gpr save/restore functions in their own section
  ...
2012-07-23 18:54:23 -07:00
David Howells
9249e17fe0 VFS: Pass mount flags to sget()
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called.  They could also be passed to the
compare function.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:38:34 +04:00
Christoph Hellwig
b5fb63c183 fs: add nd_jump_link
Add a helper that abstracts out the jump to an already parsed struct path
from ->follow_link operation from procfs.  Not only does this clean up
the code by moving the two sides of this game into a single helper, but
it also prepares for making struct nameidata private to namei.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:35:40 +04:00
Christoph Hellwig
408ef013cc fs: move path_put on failure out of ->follow_link
Currently the non-nd_set_link based versions of ->follow_link are expected
to do a path_put(&nd->path) on failure.  This calling convention is unexpected,
undocumented and doesn't match what the nd_set_link-based instances do.

Move the path_put out of the only non-nd_set_link based ->follow_link
instance into the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:35:35 +04:00
Al Viro
00cd8dd3bf stop passing nameidata to ->lookup()
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:32 +04:00
Al Viro
0b728e1911 stop passing nameidata * to ->d_revalidate()
Just the lookup flags.  Die, bastard, die...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:14 +04:00
Dong Aisheng
475d009429 of: Improve prom_update_property() function
prom_update_property() currently fails if the property doesn't
actually exist yet which isn't what we want. Change to add-or-update
instead of update-only, then we can remove a lot duplicated lines.

Suggested-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Dong Aisheng <dong.aisheng@linaro.org>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-11 15:26:51 +10:00
Linus Torvalds
0640113be2 vfs: Fix /proc/<tid>/fdinfo/<fd> file handling
Cyrill Gorcunov reports that I broke the fdinfo files with commit
30a08bf2d3 ("proc: move fd symlink i_mode calculations into
tid_fd_revalidate()"), and he's quite right.

The tid_fd_revalidate() function is not just used for the <tid>/fd
symlinks, it's also used for the <tid>/fdinfo/<fd> files, and the
permission model for those are different.

So do the dynamic symlink permission handling just for symlinks, making
the fdinfo files once more appear as the proper regular files they are.

Of course, Al Viro argued (probably correctly) that we shouldn't do the
symlink permission games at all, and make the symlinks always just be
the normal 'lrwxrwxrwx'.  That would have avoided this issue too, but
since somebody noticed that the permissions had changed (which was the
reason for that original commit 30a08bf2d3 in the first place), people
do apparently use this feature.

[ Basically, you can use the symlink permission data as a cheap "fdinfo"
  replacement, since you see whether the file is open for reading and/or
  writing by just looking at st_mode of the symlink.  So the feature
  does make sense, even if the pain it has caused means we probably
  shouldn't have done it to begin with. ]

Reported-and-tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-04 11:00:45 -07:00
Cyrill Gorcunov
5b172087f9 c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
We would like to have an ability to restore command line arguments and
program environment pointers but first we need to obtain them somehow.
Thus we put these values into /proc/$pid/stat.  The exit_code is needed to
restore zombie tasks.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:32 -07:00
Cyrill Gorcunov
818411616b fs, proc: introduce /proc/<pid>/task/<tid>/children entry
When we do checkpoint of a task we need to know the list of children the
task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).

So instead of walking over all pids in the system (creating one big
process tree in memory, just to figure out which children a task has) --
we add explicit /proc/<pid>/task/<tid>/children entry, because the kernel
already has this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:32 -07:00
Konstantin Khlebnikov
bca1554373 proc/smaps: show amount of nonlinear ptes in vma
Currently, nonlinear mappings can not be distinguished from ordinary
mappings.  This patch adds into /proc/pid/smaps line "Nonlinear: <size>
kB", where size is amount of nonlinear ptes in vma, this line appears only
if VM_NONLINEAR is set.  This information may be useful not only for
checkpoint/restore project.

Requested by Pavel Emelyanov.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:29 -07:00
Konstantin Khlebnikov
b1d4d9e0cb proc/smaps: carefully handle migration entries
Currently smaps reports migration entries as "swap", as result "swap" can
appears in shared mapping.

This patch converts migration entries into pages and handles them as usual.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:29 -07:00
Konstantin Khlebnikov
052fb0d635 proc: report file/anon bit in /proc/pid/pagemap
This is an implementation of Andrew's proposal to extend the pagemap file
bits to report what is missing about tasks' working set.

The problem with the working set detection is multilateral.  In the criu
(checkpoint/restore) project we dump the tasks' memory into image files
and to do it properly we need to detect which pages inside mappings are
really in use.  The mincore syscall I though could help with this did not.
 First, it doesn't report swapped pages, thus we cannot find out which
parts of anonymous mappings to dump.  Next, it does report pages from page
cache as present even if they are not mapped, and it doesn't make that has
not been cow-ed.

Note, that issue with swap pages is critical -- we must dump swap pages to
image file.  But the issues with file pages are optimization -- we can
take all file pages to image, this would be correct, but if we know that a
page is not mapped or not cow-ed, we can remove them from dump file.  The
dump would still be self-consistent, though significantly smaller in size
(up to 10 times smaller on real apps).

Andrew noticed, that the proc pagemap file solved 2 of 3 above issues --
it reports whether a page is present or swapped and it doesn't report not
mapped page cache pages.  But, it doesn't distinguish cow-ed file pages
from not cow-ed.

I would like to make the last unused bit in this file to report whether the
page mapped into respective pte is PageAnon or not.

[comment stolen from Pavel Emelyanov's v1 patch]

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:29 -07:00
Jan Engelhardt
715be1fce0 procfs: use more apprioriate types when dumping /proc/N/stat
- use int fpr priority and nice, since task_nice()/task_prio() return that

- field 24: get_mm_rss() returns unsigned long

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:29 -07:00
Alexey Dobriyan
af5e617143 proc: pass "fd" by value in /proc/*/{fd,fdinfo} code
Pass "fd" directly, not via pointer -- one less memory read.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:29 -07:00
Alexey Dobriyan
f05ed3f1ab proc: don't do dummy rcu_read_lock/rcu_read_unlock on error path
rcu_read_lock()/rcu_read_unlock() is nop for TINY_RCU, but is not a nop
for, say, PREEMPT_RCU.

proc_fill_cache() is called without RCU lock, there is no need to
lock/unlock on error path, simply jump out of the loop.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:29 -07:00
Cong Wang
2344bec788 proc: use mm_access() instead of ptrace_may_access()
mm_access() handles this much better, and avoids some race conditions.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:29 -07:00
Cong Wang
e7dcd9990e proc: remove mm_for_maps()
mm_for_maps() is a simple wrapper for mm_access(), and the name is
misleading, so just remove it and use mm_access() directly.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:28 -07:00
Cong Wang
b409e578d9 proc: clean up /proc/<pid>/environ handling
Similar to e268337dfe ("proc: clean up and fix /proc/<pid>/mem
handling"), move the check of permission to open(), this will simplify
read() code.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:28 -07:00
David Rientjes
a7f638f999 mm, oom: normalize oom scores to oom_score_adj scale only for userspace
The oom_score_adj scale ranges from -1000 to 1000 and represents the
proportion of memory available to the process at allocation time.  This
means an oom_score_adj value of 300, for example, will bias a process as
though it was using an extra 30.0% of available memory and a value of
-350 will discount 35.0% of available memory from its usage.

The oom killer badness heuristic also uses this scale to report the oom
score for each eligible process in determining the "best" process to
kill.  Thus, it can only differentiate each process's memory usage by
0.1% of system RAM.

On large systems, this can end up being a large amount of memory: 256MB
on 256GB systems, for example.

This can be fixed by having the badness heuristic to use the actual
memory usage in scoring threads and then normalizing it to the
oom_score_adj scale for userspace.  This results in better comparison
between eligible threads for kill and no change from the userspace
perspective.

Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:24 -07:00
Sasha Levin
08fa29d916 mm: fix NULL ptr deref when walking hugepages
A missing validation of the value returned by find_vma() could cause a
NULL ptr dereference when walking the pagetable.

This is triggerable from usermode by a simple user by trying to read a
page info out of /proc/pid/pagemap which doesn't exist.

Introduced by commit 025c5b2451 ("thp: optimize away unnecessary page
table locking").

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>		[3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:18 -07:00
Linus Torvalds
90324cc1b1 avoid iput() from flusher thread
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr
 uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP
 +AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB
 KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc
 ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS
 tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV
 4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F
 Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv
 5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63
 BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN
 NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF
 /9c4zxxekjHZnn2oooEa
 =oLXf
 -----END PGP SIGNATURE-----

Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback tree from Wu Fengguang:
 "Mainly from Jan Kara to avoid iput() in the flusher threads."

* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Avoid iput() from flusher thread
  vfs: Rename end_writeback() to clear_inode()
  vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
  writeback: Refactor writeback_single_inode()
  writeback: Remove wb->list_lock from writeback_single_inode()
  writeback: Separate inode requeueing after writeback
  writeback: Move I_DIRTY_PAGES handling
  writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
  writeback: Move clearing of I_SYNC into inode_sync_complete()
  writeback: initialize global_dirty_limit
  fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
  mm: page-writeback.c: local functions should not be exposed globally
2012-05-28 09:54:45 -07:00
Linus Torvalds
644473e9c6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace enhancements from Eric Biederman:
 "This is a course correction for the user namespace, so that we can
  reach an inexpensive, maintainable, and reasonably complete
  implementation.

  Highlights:
   - Config guards make it impossible to enable the user namespace and
     code that has not been converted to be user namespace safe.

   - Use of the new kuid_t type ensures the if you somehow get past the
     config guards the kernel will encounter type errors if you enable
     user namespaces and attempt to compile in code whose permission
     checks have not been updated to be user namespace safe.

   - All uids from child user namespaces are mapped into the initial
     user namespace before they are processed.  Removing the need to add
     an additional check to see if the user namespace of the compared
     uids remains the same.

   - With the user namespaces compiled out the performance is as good or
     better than it is today.

   - For most operations absolutely nothing changes performance or
     operationally with the user namespace enabled.

   - The worst case performance I could come up with was timing 1
     billion cache cold stat operations with the user namespace code
     enabled.  This went from 156s to 164s on my laptop (or 156ns to
     164ns per stat operation).

   - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
     Most uid/gid setting system calls treat these value specially
     anyway so attempting to use -1 as a uid would likely cause
     entertaining failures in userspace.

   - If setuid is called with a uid that can not be mapped setuid fails.
     I have looked at sendmail, login, ssh and every other program I
     could think of that would call setuid and they all check for and
     handle the case where setuid fails.

   - If stat or a similar system call is called from a context in which
     we can not map a uid we lie and return overflowuid.  The LFS
     experience suggests not lying and returning an error code might be
     better, but the historical precedent with uids is different and I
     can not think of anything that would break by lying about a uid we
     can't map.

   - Capabilities are localized to the current user namespace making it
     safe to give the initial user in a user namespace all capabilities.

  My git tree covers all of the modifications needed to convert the core
  kernel and enough changes to make a system bootable to runlevel 1."

Fix up trivial conflicts due to nearby independent changes in fs/stat.c

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
  userns:  Silence silly gcc warning.
  cred: use correct cred accessor with regards to rcu read lock
  userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
  userns: Convert cgroup permission checks to use uid_eq
  userns: Convert tmpfs to use kuid and kgid where appropriate
  userns: Convert sysfs to use kgid/kuid where appropriate
  userns: Convert sysctl permission checks to use kuid and kgids.
  userns: Convert proc to use kuid/kgid where appropriate
  userns: Convert ext4 to user kuid/kgid where appropriate
  userns: Convert ext3 to use kuid/kgid where appropriate
  userns: Convert ext2 to use kuid/kgid where appropriate.
  userns: Convert devpts to use kuid/kgid where appropriate
  userns: Convert binary formats to use kuid/kgid where appropriate
  userns: Add negative depends on entries to avoid building code that is userns unsafe
  userns: signal remove unnecessary map_cred_ns
  userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
  userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
  userns: Convert stat to return values mapped from kuids and kgids
  userns: Convert user specfied uids and gids in chown into kuids and kgid
  userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
  ...
2012-05-23 17:42:39 -07:00
Linus Torvalds
73f1f5dd3e Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc fixes from Andrew Morton.

* emailed from Andrew Morton <akpm@linux-foundation.org>: (4 patches)
  frv: delete incorrect task prototypes causing compile fail
  slub: missing test for partial pages flush work in flush_all()
  fs, proc: fix ABBA deadlock in case of execution attempt of map_files/ entries
  drivers/rtc/rtc-pl031.c: configure correct wday for 2000-01-01
2012-05-18 15:56:25 -07:00
Linus Torvalds
30a08bf2d3 proc: move fd symlink i_mode calculations into tid_fd_revalidate()
Instead of doing the i_mode calculations at proc_fd_instantiate() time,
move them into tid_fd_revalidate(), which is where the other inode state
(notably uid/gid information) is updated too.

Otherwise we'll end up with stale i_mode information if an fd is re-used
while the dentry still hangs around.  Not that anything really *cares*
(symlink permissions don't really matter), but Tetsuo Handa noticed that
the owner read/write bits don't always match the state of the
readability of the file descriptor, and we _used_ to get this right a
long time ago in a galaxy far, far away.

Besides, aside from fixing an ugly detail (that has apparently been this
way since commit 61a2878402: "proc: Remove the hard coded inode
numbers" in 2006), this removes more lines of code than it adds.  And it
just makes sense to update i_mode in the same place we update i_uid/gid.

Al Viro correctly points out that we could just do the inode fill in the
inode iops ->getattr() function instead.  However, that does require
somewhat slightly more invasive changes, and adds yet *another* lookup
of the file descriptor.  We need to do the revalidate() for other
reasons anyway, and have the file descriptor handy, so we might as well
fill in the information at this point.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-18 14:06:17 -07:00
Cyrill Gorcunov
eb94cd96e0 fs, proc: fix ABBA deadlock in case of execution attempt of map_files/ entries
map_files/ entries are never supposed to be executed, still curious
minds might try to run them, which leads to the following deadlock

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.4.0-rc4-24406-g841e6a6 #121 Not tainted
  -------------------------------------------------------
  bash/1556 is trying to acquire lock:
   (&sb->s_type->i_mutex_key#8){+.+.+.}, at: do_lookup+0x267/0x2b1

  but task is already holding lock:
   (&sig->cred_guard_mutex){+.+.+.}, at: prepare_bprm_creds+0x2d/0x69

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&sig->cred_guard_mutex){+.+.+.}:
         validate_chain+0x444/0x4f4
         __lock_acquire+0x387/0x3f8
         lock_acquire+0x12b/0x158
         __mutex_lock_common+0x56/0x3a9
         mutex_lock_killable_nested+0x40/0x45
         lock_trace+0x24/0x59
         proc_map_files_lookup+0x5a/0x165
         __lookup_hash+0x52/0x73
         do_lookup+0x276/0x2b1
         walk_component+0x3d/0x114
         do_last+0xfc/0x540
         path_openat+0xd3/0x306
         do_filp_open+0x3d/0x89
         do_sys_open+0x74/0x106
         sys_open+0x21/0x23
         tracesys+0xdd/0xe2

  -> #0 (&sb->s_type->i_mutex_key#8){+.+.+.}:
         check_prev_add+0x6a/0x1ef
         validate_chain+0x444/0x4f4
         __lock_acquire+0x387/0x3f8
         lock_acquire+0x12b/0x158
         __mutex_lock_common+0x56/0x3a9
         mutex_lock_nested+0x40/0x45
         do_lookup+0x267/0x2b1
         walk_component+0x3d/0x114
         link_path_walk+0x1f9/0x48f
         path_openat+0xb6/0x306
         do_filp_open+0x3d/0x89
         open_exec+0x25/0xa0
         do_execve_common+0xea/0x2f9
         do_execve+0x43/0x45
         sys_execve+0x43/0x5a
         stub_execve+0x6c/0xc0

This is because prepare_bprm_creds grabs task->signal->cred_guard_mutex
and when do_lookup happens we try to grab task->signal->cred_guard_mutex
again in lock_trace.

Fix it using plain ptrace_may_access() helper in proc_map_files_lookup()
and in proc_map_files_readdir() instead of lock_trace(), the caller must
be CAP_SYS_ADMIN granted anyway.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-17 18:00:51 -07:00
Eric W. Biederman
091bd3ea4e userns: Convert sysctl permission checks to use kuid and kgids.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15 14:59:28 -07:00
Eric W. Biederman
dcb0f22282 userns: Convert proc to use kuid/kgid where appropriate
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15 14:59:28 -07:00
Konstantin Khlebnikov
16fbdce62d proc/pid/pagemap: correctly report non-present ptes and holes between vmas
Reset the current pagemap-entry if the current pte isn't present, or if
current vma is over.  Otherwise pagemap reports last entry again and
again.

Non-present pte reporting was broken in commit 092b50bacd ("pagemap:
introduce data structure for pagemap entry")

Reporting for holes was broken in commit 5aaabe831e ("pagemap: avoid
splitting thp when reading /proc/pid/pagemap")

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 15:06:44 -07:00
Jan Kara
dbd5768f87 vfs: Rename end_writeback() to clear_inode()
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:41 +08:00
Eric W. Biederman
ae2975bc34 userns: Convert group_info values from gid_t to kgid_t.
As a first step to converting struct cred to be all kuid_t and kgid_t
values convert the group values stored in group_info to always be
kgid_t values.   Unless user namespaces are used this change should
have no effect.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03 03:27:21 -07:00
Eric W. Biederman
22d917d80e userns: Rework the user_namespace adding uid/gid mapping support
- Convert the old uid mapping functions into compatibility wrappers
- Add a uid/gid mapping layer from user space uid and gids to kernel
  internal uids and gids that is extent based for simplicty and speed.
  * Working with number space after mapping uids/gids into their kernel
    internal version adds only mapping complexity over what we have today,
    leaving the kernel code easy to understand and test.
- Add proc files /proc/self/uid_map /proc/self/gid_map
  These files display the mapping and allow a mapping to be added
  if a mapping does not exist.
- Allow entering the user namespace without a uid or gid mapping.
  Since we are starting with an existing user our uids and gids
  still have global mappings so are still valid and useful they just don't
  have local mappings.  The requirement for things to work are global uid
  and gid so it is odd but perfectly fine not to have a local uid
  and gid mapping.
  Not requiring global uid and gid mappings greatly simplifies
  the logic of setting up the uid and gid mappings by allowing
  the mappings to be set after the namespace is created which makes the
  slight weirdness worth it.
- Make the mappings in the initial user namespace to the global
  uid/gid space explicit.  Today it is an identity mapping
  but in the future we may want to twist this for debugging, similar
  to what we do with jiffies.
- Document the memory ordering requirements of setting the uid and
  gid mappings.  We only allow the mappings to be set once
  and there are no pointers involved so the requirments are
  trivial but a little atypical.

Performance:

In this scheme for the permission checks the performance is expected to
stay the same as the actuall machine instructions should remain the same.

The worst case I could think of is ls -l on a large directory where
all of the stat results need to be translated with from kuids and
kgids to uids and gids.  So I benchmarked that case on my laptop
with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.

My benchmark consisted of going to single user mode where nothing else
was running. On an ext4 filesystem opening 1,000,000 files and looping
through all of the files 1000 times and calling fstat on the
individuals files.  This was to ensure I was benchmarking stat times
where the inodes were in the kernels cache, but the inode values were
not in the processors cache.  My results:

v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

All of the configurations ran in roughly 120ns when I performed tests
that ran in the cpu cache.

So in summary the performance impact is:
1ns improvement in the worst case with user namespace support compiled out.
8ns aka 5% slowdown in the worst case with user namespace support compiled in.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-04-26 02:01:39 -07:00
Will Deacon
63f61a6f46 revert "proc: clear_refs: do not clear reserved pages"
Revert commit 85e72aa538 ("proc: clear_refs: do not clear reserved
pages"), which was a quick fix suitable for -stable until ARM had been
moved over to the gate_vma mechanism:

https://lkml.org/lkml/2012/1/14/55

With commit f9d4861f ("ARM: 7294/1: vectors: use gate_vma for vectors user
mapping"), ARM does now use the gate_vma, so the PageReserved check can be
removed from the proc code.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Nicolas Pitre <nico@linaro.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-25 21:26:34 -07:00
Linus Torvalds
ccb1ec95e9 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "The itimer removal one is not strictly a fix, but I really wanted to
  avoid a rebase of the urgent ones."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "clocksource: Load the ACPI PM clocksource asynchronously"
  clockevents: tTack broadcast device mode change in tick_broadcast_switch_to_oneshot()
  itimer: Use printk_once instead of WARN_ONCE
  nohz: Fix stale jiffies update in tick_nohz_restart()
  tick: Document TICK_ONESHOT config option
  proc: stats: Use arch_idle_time for idle and iowait times if available
  itimer: Schedule silent NULL pointer fixup in setitimer() for removal
2012-04-12 15:16:26 -07:00
Linus Torvalds
5d32c88f0b Merge branch 'akpm' (Andrew's patch-bomb)
Merge batch of fixes from Andrew Morton:
 "The simple_open() cleanup was held back while I wanted for laggards to
  merge things.

  I still need to send a few checkpoint/restore patches.  I've been
  wobbly about merging them because I'm wobbly about the overall
  prospects for success of the project.  But after speaking with Pavel
  at the LSF conference, it sounds like they're further toward
  completion than I feared - apparently davem is at the "has stopped
  complaining" stage regarding the net changes.  So I need to go back
  and re-review those patchs and their (lengthy) discussion."

* emailed from Andrew Morton <akpm@linux-foundation.org>: (16 patches)
  memcg swap: use mem_cgroup_uncharge_swap fix
  backlight: add driver for DA9052/53 PMIC v1
  C6X: use set_current_blocked() and block_sigmask()
  MAINTAINERS: add entry for sparse checker
  MAINTAINERS: fix REMOTEPROC F: typo
  alpha: use set_current_blocked() and block_sigmask()
  simple_open: automatically convert to simple_open()
  scripts/coccinelle/api/simple_open.cocci: semantic patch for simple_open()
  libfs: add simple_open()
  hugetlbfs: remove unregister_filesystem() when initializing module
  drivers/rtc/rtc-88pm860x.c: fix rtc irq enable callback
  fs/xattr.c:setxattr(): improve handling of allocation failures
  fs/xattr.c:listxattr(): fall back to vmalloc() if kmalloc() failed
  fs/xattr.c: suppress page allocation failure warnings from sys_listxattr()
  sysrq: use SEND_SIG_FORCED instead of force_sig()
  proc: fix mount -t proc -o AAA
2012-04-05 15:30:34 -07:00
Vasiliy Kulikov
99663be772 proc: fix mount -t proc -o AAA
The proc_parse_options() call from proc_mount() runs only once at boot
time.  So on any later mount attempt, any mount options are ignored
because ->s_root is already initialized.

As a consequence, "mount -o <options>" will ignore the options.  The
only way to change mount options is "mount -o remount,<options>".

To fix this, parse the mount options unconditionally.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Tested-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-05 15:25:50 -07:00
Martin Schwidefsky
cb85a6ed67 proc: stats: Use arch_idle_time for idle and iowait times if available
Git commit a25cac5198 "proc: Consider NO_HZ when printing idle and
iowait times" changes the code for /proc/stat to use get_cpu_idle_time_us
and get_cpu_iowait_time_us if the system is running with nohz enabled.
For architectures which define arch_idle_time (currently s390 only)
this is a change for the worse. The result of arch_idle_time is supposed
to be the exact sleep time of the target cpu and should be used instead
of the value kept by the scheduler.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120330122308.18720283@de.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-30 15:43:33 +02:00
Linus Torvalds
a591afc01d Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x32 support for x86-64 from Ingo Molnar:
 "This tree introduces the X32 binary format and execution mode for x86:
  32-bit data space binaries using 64-bit instructions and 64-bit kernel
  syscalls.

  This allows applications whose working set fits into a 32 bits address
  space to make use of 64-bit instructions while using a 32-bit address
  space with shorter pointers, more compressed data structures, etc."

Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  x32: Fix alignment fail in struct compat_siginfo
  x32: Fix stupid ia32/x32 inversion in the siginfo format
  x32: Add ptrace for x32
  x32: Switch to a 64-bit clock_t
  x32: Provide separate is_ia32_task() and is_x32_task() predicates
  x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
  x86/x32: Fix the binutils auto-detect
  x32: Warn and disable rather than error if binutils too old
  x32: Only clear TIF_X32 flag once
  x32: Make sure TS_COMPAT is cleared for x32 tasks
  fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
  fs: Fix close_on_exec pointer in alloc_fdtable
  x32: Drop non-__vdso weak symbols from the x32 VDSO
  x32: Fix coding style violations in the x32 VDSO code
  x32: Add x32 VDSO support
  x32: Allow x32 to be configured
  x32: If configured, add x32 system calls to system call tables
  x32: Handle process creation
  x32: Signal-related system calls
  x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
  ...
2012-03-29 18:12:23 -07:00
Linus Torvalds
18a06efae5 Merge branch 'akpm' (Andrew's patch-bomb)
Single fix for a commit from the first batch of patches through Andrew.

* emailed from Andrew Morton <akpm@linux-foundation.org>:
  pagemap: remove remaining unneeded spin_lock()
2012-03-29 14:07:08 -07:00
Naoya Horiguchi
10bdfb5ef7 pagemap: remove remaining unneeded spin_lock()
Commit 025c5b2451 ("thp: optimize away unnecessary page table
locking") moves spin_lock() into pmd_trans_huge_lock() in order to avoid
locking unless pmd is for thp.  So this spin_lock() is a bug.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-29 14:06:43 -07:00
Linus Torvalds
532bfc851a Merge branch 'akpm' (Andrew's patch-bomb)
Merge third batch of patches from Andrew Morton:
 - Some MM stragglers
 - core SMP library cleanups (on_each_cpu_mask)
 - Some IPI optimisations
 - kexec
 - kdump
 - IPMI
 - the radix-tree iterator work
 - various other misc bits.

 "That'll do for -rc1.  I still have ~10 patches for 3.4, will send
  those along when they've baked a little more."

* emailed from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  backlight: fix typo in tosa_lcd.c
  crc32: add help text for the algorithm select option
  mm: move hugepage test examples to tools/testing/selftests/vm
  mm: move slabinfo.c to tools/vm
  mm: move page-types.c from Documentation to tools/vm
  selftests/Makefile: make `run_tests' depend on `all'
  selftests: launch individual selftests from the main Makefile
  radix-tree: use iterators in find_get_pages* functions
  radix-tree: rewrite gang lookup using iterator
  radix-tree: introduce bit-optimized iterator
  fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
  nbd: rename the nbd_device variable from lo to nbd
  pidns: add reboot_pid_ns() to handle the reboot syscall
  sysctl: use bitmap library functions
  ipmi: use locks on watchdog timeout set on reboot
  ipmi: simplify locking
  ipmi: fix message handling during panics
  ipmi: use a tasklet for handling received messages
  ipmi: increase KCS timeouts
  ipmi: decrease the IPMI message transaction time in interrupt mode
  ...
2012-03-28 17:19:28 -07:00
Andrew Morton
4c619aa0ba fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
If CONFIG_NET_NS, CONFIG_UTS_NS and CONFIG_IPC_NS are disabled,
ns_entries[] becomes empty and things like
ns_entries[ARRAY_SIZE(ns_entries) - 1] will explode.

Reported-by: Richard Weinberger <richard@nod.at>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:37 -07:00
Andrea Arcangeli
45f83cefe3 mm: thp: fix up pmd_trans_unstable() locations
pmd_trans_unstable() should be called before pmd_offset_map() in the
locations where the mmap_sem is held for reading.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:35 -07:00
KAMEZAWA Hiroyuki
3748b2f15b procfs: fix /proc/statm
bda7bad62b ("procfs: speed up /proc/pid/stat, statm") broke /proc/statm
- 'text' is printed twice by mistake.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reported-by: Ulrich Drepper <drepper@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:35 -07:00
Linus Torvalds
0195c00244 Disintegrate and delete asm/system.h
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
 8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
 u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
 ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
 rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
 eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
 HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
 /5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
 Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
 4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
 FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
 NypQthI85pc=
 =G9mT
 -----END PGP SIGNATURE-----

Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system

Pull "Disintegrate and delete asm/system.h" from David Howells:
 "Here are a bunch of patches to disintegrate asm/system.h into a set of
  separate bits to relieve the problem of circular inclusion
  dependencies.

  I've built all the working defconfigs from all the arches that I can
  and made sure that they don't break.

  The reason for these patches is that I recently encountered a circular
  dependency problem that came about when I produced some patches to
  optimise get_order() by rewriting it to use ilog2().

  This uses bitops - and on the SH arch asm/bitops.h drags in
  asm-generic/get_order.h by a circuituous route involving asm/system.h.

  The main difficulty seems to be asm/system.h.  It holds a number of
  low level bits with no/few dependencies that are commonly used (eg.
  memory barriers) and a number of bits with more dependencies that
  aren't used in many places (eg.  switch_to()).

  These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

        Move memory barriers here.  This already done for MIPS and Alpha.

    (2) asm/switch_to.h

        Move switch_to() and related stuff here.

    (3) asm/exec.h

        Move arch_align_stack() here.  Other process execution related bits
        could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

        Move xchg() and cmpxchg() here as they're full word atomic ops and
        frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

        Move die() and related bits.

    (6) asm/auxvec.h

        Move AT_VECTOR_SIZE_ARCH here.

  Other arch headers are created as needed on a per-arch basis."

Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that.  We'll find out anything that got broken and fix it..

* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
  Delete all instances of asm/system.h
  Remove all #inclusions of asm/system.h
  Add #includes needed to permit the removal of asm/system.h
  Move all declarations of free_initmem() to linux/mm.h
  Disintegrate asm/system.h for OpenRISC
  Split arch_align_stack() out from asm-generic/system.h
  Split the switch_to() wrapper out of asm-generic/system.h
  Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
  Create asm-generic/barrier.h
  Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
  Disintegrate asm/system.h for Xtensa
  Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
  Disintegrate asm/system.h for Tile
  Disintegrate asm/system.h for Sparc
  Disintegrate asm/system.h for SH
  Disintegrate asm/system.h for Score
  Disintegrate asm/system.h for S390
  Disintegrate asm/system.h for PowerPC
  Disintegrate asm/system.h for PA-RISC
  Disintegrate asm/system.h for MN10300
  ...
2012-03-28 15:58:21 -07:00
David Howells
9ffc93f203 Remove all #inclusions of asm/system.h
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it.  Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28 18:30:03 +01:00
Linus Torvalds
f1d38e423a Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl
Pull sysctl updates from Eric Biederman:

 - Rewrite of sysctl for speed and clarity.

   Insert/remove/Lookup in sysctl are all now O(NlogN) operations, and
   are no longer bottlenecks in the process of adding and removing
   network devices.

   sysctl is now focused on being a filesystem instead of system call
   and the code can all be found in fs/proc/proc_sysctl.c.  Hopefully
   this means the code is now approachable.

   Much thanks is owed to Lucian Grinjincu for keeping at this until
   something was found that was usable.

 - The recent proc_sys_poll oops found by the fuzzer during hibernation
   is fixed.

* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl: (36 commits)
  sysctl: protect poll() in entries that may go away
  sysctl: Don't call sysctl_follow_link unless we are a link.
  sysctl: Comments to make the code clearer.
  sysctl: Correct error return from get_subdir
  sysctl: An easier to read version of find_subdir
  sysctl: fix memset parameters in setup_sysctl_set()
  sysctl: remove an unused variable
  sysctl: Add register_sysctl for normal sysctl users
  sysctl: Index sysctl directories with rbtrees.
  sysctl: Make the header lists per directory.
  sysctl: Move sysctl_check_dups into insert_header
  sysctl: Modify __register_sysctl_paths to take a set instead of a root and an nsproxy
  sysctl: Replace root_list with links between sysctl_table_sets.
  sysctl: Add sysctl_print_dir and use it in get_subdir
  sysctl: Stop requiring explicit management of sysctl directories
  sysctl: Add a root pointer to ctl_table_set
  sysctl: Rewrite proc_sys_readdir in terms of first_entry and next_entry
  sysctl: Rewrite proc_sys_lookup introducing find_entry and lookup_entry.
  sysctl: Normalize the root_table data structure.
  sysctl: Factor out insert_header and erase_header
  ...
2012-03-23 18:08:58 -07:00
Pravin B Shelar
1b26c9b334 proc-ns: use d_set_d_op() API to set dentry ops in proc_ns_instantiate().
The namespace cleanup path leaks a dentry which holds a reference count
on a network namespace.  Keeping that network namespace from being freed
when the last user goes away.  Leaving things like vlan devices in the
leaked network namespace.

If you use ip netns add for much real work this problem becomes apparent
pretty quickly.  It light testing the problem hides because frequently
you simply don't notice the leak.

Use d_set_d_op() so that DCACHE_OP_* flags are set correctly.

This issue exists back to 3.0.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:42 -07:00
KAMEZAWA Hiroyuki
bda7bad62b procfs: speed up /proc/pid/stat, statm
Process accounting applications as top, ps visit some files under
/proc/<pid>.  With seq_put_decimal_ull(), we can optimize /proc/<pid>/stat
and /proc/<pid>/statm files.

This patch adds
  - seq_put_decimal_ll() for signed values.
  - allow delimiter == 0.
  - convert seq_printf() to seq_put_decimal_ull/ll in /proc/stat, statm.

Test result on a system with 2000+ procs.

Before patch:
  [kamezawa@bluextal test]$ top -b -n 1 | wc -l
  2223
  [kamezawa@bluextal test]$ time top -b -n 1 > /dev/null

  real    0m0.675s
  user    0m0.044s
  sys     0m0.121s

  [kamezawa@bluextal test]$ time ps -elf > /dev/null

  real    0m0.236s
  user    0m0.056s
  sys     0m0.176s

After patch:
  kamezawa@bluextal ~]$ time top -b -n 1 > /dev/null

  real    0m0.657s
  user    0m0.052s
  sys     0m0.100s

  [kamezawa@bluextal ~]$ time ps -elf > /dev/null

  real    0m0.198s
  user    0m0.050s
  sys     0m0.145s

Considering top, ps tend to scan /proc periodically, this will reduce cpu
consumption by top/ps to some extent.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:42 -07:00
KAMEZAWA Hiroyuki
1ac101a5d6 procfs: add num_to_str() to speed up /proc/stat
== stat_check.py
num = 0
with open("/proc/stat") as f:
        while num < 1000 :
                data = f.read()
                f.seek(0, 0)
                num = num + 1
==

perf shows

    20.39%  stat_check.py  [kernel.kallsyms]    [k] format_decode
    13.41%  stat_check.py  [kernel.kallsyms]    [k] number
    12.61%  stat_check.py  [kernel.kallsyms]    [k] vsnprintf
    10.85%  stat_check.py  [kernel.kallsyms]    [k] memcpy
     4.85%  stat_check.py  [kernel.kallsyms]    [k] radix_tree_lookup
     4.43%  stat_check.py  [kernel.kallsyms]    [k] seq_printf

This patch removes most of calls to vsnprintf() by adding num_to_str()
and seq_print_decimal_ull(), which prints decimal numbers without rich
functions provided by printf().

On my 8cpu box.
== Before patch ==
[root@bluextal test]# time ./stat_check.py

real    0m0.150s
user    0m0.026s
sys     0m0.121s

== After patch ==
[root@bluextal test]# time ./stat_check.py

real    0m0.055s
user    0m0.022s
sys     0m0.030s

[akpm@linux-foundation.org: remove incorrect comment, use less statck in num_to_str(), move comment from .h to .c, simplify seq_put_decimal_ull()]
[andrea@betterlinux.com: avoid breaking the ABI in /proc/stat]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Turner <pjt@google.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:42 -07:00
Eric Dumazet
59a32e2ce5 proc: speed up /proc/stat handling
On a typical 16 cpus machine, "cat /proc/stat" gives more than 4096 bytes,
and is slow :

  # strace -T -o /tmp/STRACE cat /proc/stat | wc -c
  5826
  # grep "cpu " /tmp/STRACE
  read(0, "cpu  1949310 19 2144714 12117253"..., 32768) = 5826 <0.001504>

Thats partly because show_stat() must be called twice since initial
buffer size is too small (4096 bytes for less than 32 possible cpus)

Fix this by :

 1) Taking into account nr_irqs in the initial buffer sizing.

 2) Using ksize() to allow better filling of initial buffer.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:42 -07:00
Djalal Harouni
b908243c54 fs/proc/kcore.c: make get_sparsemem_vmemmap_info() static
get_sparsemem_vmemmap_info() is only used inside fs/proc/kcore.c

Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:42 -07:00
Lucas De Marchi
4e474a00d7 sysctl: protect poll() in entries that may go away
Protect code accessing ctl_table by grabbing the header with grab_header()
and after releasing with sysctl_head_finish().  This is needed if poll()
is called in entries created by modules: currently only hostname and
domainname support poll(), but this bug may be triggered when/if modules
use it and if user called poll() in a file that doesn't support it.

Dave Jones reported the following when using a syscall fuzzer while
hibernating/resuming:

RIP: 0010:[<ffffffff81233e3e>]  [<ffffffff81233e3e>] proc_sys_poll+0x4e/0x90
RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000
RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940
[ ... ]
Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83
7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 <8b> 16
b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89

If an entry goes away while we are polling() it, ctl_table may not exist
anymore.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-03-22 14:46:56 -07:00
Linus Torvalds
95211279c5 Merge branch 'akpm' (Andrew's patch-bomb)
Merge first batch of patches from Andrew Morton:
 "A few misc things and all the MM queue"

* emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits)
  memcg: avoid THP split in task migration
  thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
  memcg: clean up existing move charge code
  mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
  mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
  mm/memcontrol.c: s/stealed/stolen/
  memcg: fix performance of mem_cgroup_begin_update_page_stat()
  memcg: remove PCG_FILE_MAPPED
  memcg: use new logic for page stat accounting
  memcg: remove PCG_MOVE_LOCK flag from page_cgroup
  memcg: simplify move_account() check
  memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
  memcg: kill dead prev_priority stubs
  memcg: remove PCG_CACHE page_cgroup flag
  memcg: let css_get_next() rely upon rcu_read_lock()
  cgroup: revert ss_id_lock to spinlock
  idr: make idr_get_next() good for rcu_read_lock()
  memcg: remove unnecessary thp check in page stat accounting
  memcg: remove redundant returns
  memcg: enum lru_list lru
  ...
2012-03-22 09:04:48 -07:00
Linus Torvalds
5375871d43 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc merge from Benjamin Herrenschmidt:
 "Here's the powerpc batch for this merge window.  It is going to be a
  bit more nasty than usual as in touching things outside of
  arch/powerpc mostly due to the big iSeriesectomy :-) We finally got
  rid of the bugger (legacy iSeries support) which was a PITA to
  maintain and that nobody really used anymore.

  Here are some of the highlights:

   - Legacy iSeries is gone.  Thanks Stephen ! There's still some bits
     and pieces remaining if you do a grep -ir series arch/powerpc but
     they are harmless and will be removed in the next few weeks
     hopefully.

   - The 'fadump' functionality (Firmware Assisted Dump) replaces the
     previous (equivalent) "pHyp assisted dump"...  it's a rewrite of a
     mechanism to get the hypervisor to do crash dumps on pSeries, the
     new implementation hopefully being much more reliable.  Thanks
     Mahesh Salgaonkar.

   - The "EEH" code (pSeries PCI error handling & recovery) got a big
     spring cleaning, motivated by the need to be able to implement a
     new backend for it on top of some new different type of firwmare.

     The work isn't complete yet, but a good chunk of the cleanups is
     there.  Note that this adds a field to struct device_node which is
     not very nice and which Grant objects to.  I will have a patch soon
     that moves that to a powerpc private data structure (hopefully
     before rc1) and we'll improve things further later on (hopefully
     getting rid of the need for that pointer completely).  Thanks Gavin
     Shan.

   - I dug into our exception & interrupt handling code to improve the
     way we do lazy interrupt handling (and make it work properly with
     "edge" triggered interrupt sources), and while at it found & fixed
     a wagon of issues in those areas, including adding support for page
     fault retry & fatal signals on page faults.

   - Your usual random batch of small fixes & updates, including a bunch
     of new embedded boards, both Freescale and APM based ones, etc..."

I fixed up some conflicts with the generalized irq-domain changes from
Grant Likely, hopefully correctly.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (141 commits)
  powerpc/ps3: Do not adjust the wrapper load address
  powerpc: Remove the rest of the legacy iSeries include files
  powerpc: Remove the remaining CONFIG_PPC_ISERIES pieces
  init: Remove CONFIG_PPC_ISERIES
  powerpc: Remove FW_FEATURE ISERIES from arch code
  tty/hvc_vio: FW_FEATURE_ISERIES is no longer selectable
  powerpc/spufs: Fix double unlocks
  powerpc/5200: convert mpc5200 to use of_platform_populate()
  powerpc/mpc5200: add options to mpc5200_defconfig
  powerpc/mpc52xx: add a4m072 board support
  powerpc/mpc5200: update mpc5200_defconfig to fit for charon board
  Documentation/powerpc/mpc52xx.txt: Checkpatch cleanup
  powerpc/44x: Add additional device support for APM821xx SoC and Bluestone board
  powerpc/44x: Add support PCI-E for APM821xx SoC and Bluestone board
  MAINTAINERS: Update PowerPC 4xx tree
  powerpc/44x: The bug fixed support for APM821xx SoC and Bluestone board
  powerpc: document the FSL MPIC message register binding
  powerpc: add support for MPIC message register API
  powerpc/fsl: Added aliased MSIIR register address to MSI node in dts
  powerpc/85xx: mpc8548cds - add 36-bit dts
  ...
2012-03-21 18:55:10 -07:00
Siddhesh Poyarekar
b76437579d procfs: mark thread stack correctly in proc/<pid>/maps
Stack for a new thread is mapped by userspace code and passed via
sys_clone.  This memory is currently seen as anonymous in
/proc/<pid>/maps, which makes it difficult to ascertain which mappings
are being used for thread stacks.  This patch uses the individual task
stack pointers to determine which vmas are actually thread stacks.

For a multithreaded program like the following:

	#include <pthread.h>

	void *thread_main(void *foo)
	{
		while(1);
	}

	int main()
	{
		pthread_t t;
		pthread_create(&t, NULL, thread_main, NULL);
		pthread_join(t, NULL);
	}

proc/PID/maps looks like the following:

    00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
    7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
    7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
    7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
    7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
    7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
    7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
    7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
that is not always a reliable way to find out which vma is a thread
stack.  Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
content.

With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
as the task would see it' and hence, only the vma that that task uses as
stack is marked as [stack].  All other 'stack' vmas are marked as
anonymous memory.  /proc/PID/maps acts as a thread group level view,
where all thread stack vmas are marked as [stack:TID] where TID is the
process ID of the task that uses that vma as stack, while the process
stack is marked as [stack].

So /proc/PID/maps will look like this:

    00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
    7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
    7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
    7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
    7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
    7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
    7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
    7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Thus marking all vmas that are used as stacks by the threads in the
thread group along with the process stack.  The task level maps will
however like this:

    00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
    7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
    7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
    7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
    7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
    7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
    7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
    7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

where only the vma that is being used as a stack by *that* task is
marked as [stack].

Analogous changes have been made to /proc/PID/smaps,
/proc/PID/numa_maps, /proc/PID/task/TID/smaps and
/proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
numa_maps:

    [siddhesh@localhost ~ ]$ pgrep a.out
    1441
    [siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
    [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
    [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
    [siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
    7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
    7fff6273a000 default stack anon=3 dirty=3 N0=3
    [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
    7f8a44492000 default stack anon=2 dirty=2 N0=2
    [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
    7fff6273a000 default stack anon=3 dirty=3 N0=3

[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:58 -07:00
Naoya Horiguchi
092b50bacd pagemap: introduce data structure for pagemap entry
Currently a local variable of pagemap entry in pagemap_pte_range() is
named pfn and typed with u64, but it's not correct (pfn should be unsigned
long.)

This patch introduces special type for pagemap entries and replaces code
with it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:57 -07:00
Naoya Horiguchi
e873c49fbf pagemap: export KPF_THP
This flag shows that a given page is a subpage of a transparent hugepage.
It helps us debug and test the kernel by showing physical address of thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:57 -07:00
Naoya Horiguchi
025c5b2451 thp: optimize away unnecessary page table locking
Currently when we check if we can handle thp as it is or we need to split
it into regular sized pages, we hold page table lock prior to check
whether a given pmd is mapping thp or not.  Because of this, when it's not
"huge pmd" we suffer from unnecessary lock/unlock overhead.  To remove it,
this patch introduces a optimized check function and replace several
similar logics with it.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:57 -07:00
Naoya Horiguchi
5aaabe831e pagemap: avoid splitting thp when reading /proc/pid/pagemap
Thp split is not necessary if we explicitly check whether pmds are mapping
thps or not.  This patch introduces this check and adds code to generate
pagemap entries for pmds mapping thps, which results in less performance
impact of pagemap on thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:56 -07:00
Andrea Arcangeli
1a5a9906d4 mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode.  In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.

It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds).  The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().

Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously.  This is
probably why it wasn't common to run into this.  For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.

Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).

The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value.  Even if the real pmd is changing under the
value we hold on the stack, we don't care.  If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).

All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd.  The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds).  I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).

		if (pmd_trans_huge(*pmd)) {
			if (next-addr != HPAGE_PMD_SIZE) {
				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
				split_huge_page_pmd(vma->vm_mm, pmd);
			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
				continue;
			/* fall through */
		}
		if (pmd_none_or_clear_bad(pmd))

Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.

The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.

====== start quote =======
      mapcount 0 page_mapcount 1
      kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

      mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

        143 void pmd_clear_bad(pmd_t *pmd)
        144 {
    ->  145         pmd_ERROR(*pmd);
        146         pmd_clear(pmd);
        147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

       1381         if (mapcount != page_mapcount(page))
       1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
       1383                        mapcount, page_mapcount(page));
    -> 1384         BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

               virtual address space
              .---------------------.
              |                     |
              |                     |
            .-|---------------------|
            | |                     |
            | |                     |<-- B(fault)
            | |                     |
      2 MB  | |/////////////////////|-.
      huge <  |/////////////////////|  > A(range)
      page  | |/////////////////////|-'
            | |                     |
            | |                     |
            '-|---------------------|
              |                     |
              |                     |
              '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
      on the virtual address range "A(range)" shown in the picture.

    sys_madvise
      // Acquire the semaphore in shared mode.
      down_read(&current->mm->mmap_sem)
      ...
      madvise_vma
        switch (behavior)
        case MADV_DONTNEED:
             madvise_dontneed
               zap_page_range
                 unmap_vmas
                   unmap_page_range
                     zap_pud_range
                       zap_pmd_range
                         //
                         // Assume that this huge page has never been accessed.
                         // I.e. content of the PMD entry is zero (not mapped).
                         //
                         if (pmd_trans_huge(*pmd)) {
                             // We don't get here due to the above assumption.
                         }
                         //
                         // Assume that Thread B incurred a page fault and
             .---------> // sneaks in here as shown below.
             |           //
             |           if (pmd_none_or_clear_bad(pmd))
             |               {
             |                 if (unlikely(pmd_bad(*pmd)))
             |                     pmd_clear_bad
             |                     {
             |                       pmd_ERROR
             |                         // Log "bad pmd ..." message here.
             |                       pmd_clear
             |                         // Clear the page's PMD entry.
             |                         // Thread B incremented the map count
             |                         // in page_add_new_anon_rmap(), but
             |                         // now the page is no longer mapped
             |                         // by a PMD entry (-> inconsistency).
             |                     }
             |               }
             |
             v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
      in the picture.

    ...
    do_page_fault
      __do_page_fault
        // Acquire the semaphore in shared mode.
        down_read_trylock(&mm->mmap_sem)
        ...
        handle_mm_fault
          if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
              // We get here due to the above assumption (PMD entry is zero).
              do_huge_pmd_anonymous_page
                alloc_hugepage_vma
                  // Allocate a new transparent huge page here.
                ...
                __do_huge_pmd_anonymous_page
                  ...
                  spin_lock(&mm->page_table_lock)
                  ...
                  page_add_new_anon_rmap
                    // Here we increment the page's map count (starts at -1).
                    atomic_set(&page->_mapcount, 0)
                  set_pmd_at
                    // Here we set the page's PMD entry which will be cleared
                    // when Thread A calls pmd_clear_bad().
                  ...
                  spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read).  Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated.  However, Thread A
    does not synchronize on that lock.

====== end quote =======

[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>		[2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:54 -07:00
Linus Torvalds
e2a0883e40 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 1 from Al Viro:
 "This is _not_ all; in particular, Miklos' and Jan's stuff is not there
  yet."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
  ext4: initialization of ext4_li_mtx needs to be done earlier
  debugfs-related mode_t whack-a-mole
  hfsplus: add an ioctl to bless files
  hfsplus: change finder_info to u32
  hfsplus: initialise userflags
  qnx4: new helper - try_extent()
  qnx4: get rid of qnx4_bread/qnx4_getblk
  take removal of PF_FORKNOEXEC to flush_old_exec()
  trim includes in inode.c
  um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
  um: embed ->stub_pages[] into mmu_context
  gadgetfs: list_for_each_safe() misuse
  ocfs2: fix leaks on failure exits in module_init
  ecryptfs: make register_filesystem() the last potential failure exit
  ntfs: forgets to unregister sysctls on register_filesystem() failure
  logfs: missing cleanup on register_filesystem() failure
  jfs: mising cleanup on register_filesystem() failure
  make configfs_pin_fs() return root dentry on success
  configfs: configfs_create_dir() has parent dentry in dentry->d_parent
  configfs: sanitize configfs_create()
  ...
2012-03-21 13:36:41 -07:00
Linus Torvalds
3556485f15 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates for 3.4 from James Morris:
 "The main addition here is the new Yama security module from Kees Cook,
  which was discussed at the Linux Security Summit last year.  Its
  purpose is to collect miscellaneous DAC security enhancements in one
  place.  This also marks a departure in policy for LSM modules, which
  were previously limited to being standalone access control systems.
  Chromium OS is using Yama, and I believe there are plans for Ubuntu,
  at least.

  This patchset also includes maintenance updates for AppArmor, TOMOYO
  and others."

Fix trivial conflict in <net/sock.h> due to the jumo_label->static_key
rename.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
  AppArmor: Fix location of const qualifier on generated string tables
  TOMOYO: Return error if fails to delete a domain
  AppArmor: add const qualifiers to string arrays
  AppArmor: Add ability to load extended policy
  TOMOYO: Return appropriate value to poll().
  AppArmor: Move path failure information into aa_get_name and rename
  AppArmor: Update dfa matching routines.
  AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
  AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
  AppArmor: Add const qualifiers to generated string tables
  AppArmor: Fix oops in policy unpack auditing
  AppArmor: Fix error returned when a path lookup is disconnected
  KEYS: testing wrong bit for KEY_FLAG_REVOKED
  TOMOYO: Fix mount flags checking order.
  security: fix ima kconfig warning
  AppArmor: Fix the error case for chroot relative path name lookup
  AppArmor: fix mapping of META_READ to audit and quiet flags
  AppArmor: Fix underflow in xindex calculation
  AppArmor: Fix dropping of allowed operations that are force audited
  AppArmor: Add mising end of structure test to caps unpacking
  ...
2012-03-21 13:25:04 -07:00
Linus Torvalds
69a7aebcf0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "It's indeed trivial -- mostly documentation updates and a bunch of
  typo fixes from Masanari.

  There are also several linux/version.h include removals from Jesper."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
  kcore: fix spelling in read_kcore() comment
  constify struct pci_dev * in obvious cases
  Revert "char: Fix typo in viotape.c"
  init: fix wording error in mm_init comment
  usb: gadget: Kconfig: fix typo for 'different'
  Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
  writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
  writeback: fix typo in the writeback_control comment
  Documentation: Fix multiple typo in Documentation
  tpm_tis: fix tis_lock with respect to RCU
  Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
  Doc: Update numastat.txt
  qla4xxx: Add missing spaces to error messages
  compiler.h: Fix typo
  security: struct security_operations kerneldoc fix
  Documentation: broken URL in libata.tmpl
  Documentation: broken URL in filesystems.tmpl
  mtd: simplify return logic in do_map_probe()
  mm: fix comment typo of truncate_inode_pages_range
  power: bq27x00: Fix typos in comment
  ...
2012-03-20 21:12:50 -07:00
Al Viro
48fde701af switch open-coded instances of d_make_root() to new helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:35 -04:00
Al Viro
6b4231e2f9 procfs: clean proc_fill_super() up
First of all, there's no need to zero ->i_uid/->i_gid on root inode -
both had been set to zero already.  Moreover, let's take the iput()
on failure to the failure exit it belongs to...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:34 -04:00
Laura Vasilescu
f1f996b66c kcore: fix spelling in read_kcore() comment
Signed-off-by: Laura Vasilescu <laura@rosedu.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-03-20 12:24:10 +01:00
Hiroshi Shimamoto
2e5b5b3a1b sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
Pass nice as a value to proc_sched_autogroup_set_nice().

No side effect is expected, and the variable err will be overwritten with
the return value.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F45FBB7.5090607@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-02 12:23:49 +01:00
Mahesh Salgaonkar
1625739376 fadump: Introduce cleanup routine to invalidate /proc/vmcore.
With the firmware-assisted dump support we don't require a reboot when we
are in second kernel after crash. The second kernel after crash is a normal
kernel boot and has knowledge about entire system RAM with the page tables
initialized for entire system RAM. Hence once the dump is saved to disk, we
can just release the reserved memory area for general use and continue
with second kernel as production kernel.

Hence when we release the reserved memory that contains dump data, the
'/proc/vmcore' will not be valid anymore. Hence this patch introduces
a cleanup routine that invalidates and removes the /proc/vmcore file. This
routine will be invoked before we release the reserved dump memory area.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-02-23 10:50:02 +11:00
David Howells
1dce27c5aa Wrap accesses to the fd_sets in struct fdtable
Wrap accesses to the fd_sets in struct fdtable (for recording open files and
close-on-exec flags) so that we can move away from using fd_sets since we
abuse the fd_set structs by not allocating the full-sized structure under
normal circumstances and by non-core code looking at the internals of the
fd_sets.

The first abuse means that use of FD_ZERO() on these fd_sets is not permitted,
since that cannot be told about their abnormal lengths.

This introduces six wrapper functions for setting, clearing and testing
close-on-exec flags and fd-is-open flags:

	void __set_close_on_exec(int fd, struct fdtable *fdt);
	void __clear_close_on_exec(int fd, struct fdtable *fdt);
	bool close_on_exec(int fd, const struct fdtable *fdt);
	void __set_open_fd(int fd, struct fdtable *fdt);
	void __clear_open_fd(int fd, struct fdtable *fdt);
	bool fd_is_open(int fd, const struct fdtable *fdt);

Note that I've prepended '__' to the names of the set/clear functions because
they require the caller to hold a lock to use them.

Note also that I haven't added wrappers for looking behind the scenes at the
the array.  Possibly that should exist too.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20120216174942.23314.1364.stgit@warthog.procyon.org.uk
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2012-02-19 10:30:52 -08:00
Al Viro
4040153087 security: trim security.h
Trim security.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
2012-02-14 10:45:42 +11:00
Christopher Yeoh
8cdb878dcb Fix race in process_vm_rw_core
This fixes the race in process_vm_core found by Oleg (see

  http://article.gmane.org/gmane.linux.kernel/1235667/

for details).

This has been updated since I last sent it as the creation of the new
mm_access() function did almost exactly the same thing as parts of the
previous version of this patch did.

In order to use mm_access() even when /proc isn't enabled, we move it to
kernel/fork.c where other related process mm access functions already
are.

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-02 12:55:17 -08:00
Eric W. Biederman
4e75732035 sysctl: Don't call sysctl_follow_link unless we are a link.
There are no functional changes.  Just code motion to make it
clear that we don't follow a link between sysctl roots unless the
directory entry actually is a link.

Suggested-by:  Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01 19:21:38 -08:00
Eric W. Biederman
60f126d93b sysctl: Comments to make the code clearer.
Document get_subdir and that find_subdir alwasy takes a reference.

Suggested-by:  Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01 19:20:57 -08:00
Eric W. Biederman
0eb97f38d2 sysctl: Correct error return from get_subdir
When insert_header fails ensure we return the proper error value
from get_subdir.  In practice nothing cares, but there is no
need to be sloppy.

Reported-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01 19:20:40 -08:00
Eric W. Biederman
51f72f4a0f sysctl: An easier to read version of find_subdir
Suggested-by:  Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01 19:20:30 -08:00
Oleg Nesterov
6d08f2c713 proc: make sure mem_open() doesn't pin the target's memory
Once /proc/pid/mem is opened, the memory can't be released until
mem_release() even if its owner exits.

Change mem_open() to do atomic_inc(mm_count) + mmput(), this only
pins mm_struct. Change mem_rw() to do atomic_inc_not_zero(mm_count)
before access_remote_vm(), this verifies that this mm is still alive.

I am not sure what should mem_rw() return if atomic_inc_not_zero()
fails. With this patch it returns zero to match the "mm == NULL" case,
may be it should return -EINVAL like it did before e268337d.

Perhaps it makes sense to add the additional fatal_signal_pending()
check into the main loop, to ensure we do not hold this memory if
the target task was oom-killed.

Cc: stable@kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-01 14:39:01 -08:00
Oleg Nesterov
572d34b946 proc: unify mem_read() and mem_write()
No functional changes, cleanup and preparation.

mem_read() and mem_write() are very similar. Move this code into the
new common helper, mem_rw(), which takes the additional "int write"
argument.

Cc: stable@kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-01 14:39:01 -08:00
Oleg Nesterov
71879d3cb3 proc: mem_release() should check mm != NULL
mem_release() can hit mm == NULL, add the necessary check.

Cc: stable@kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-01 14:39:01 -08:00
Dan Carpenter
1347440db6 sysctl: fix memset parameters in setup_sysctl_set()
The current code is a nop.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-30 19:14:08 -08:00
Dan Carpenter
4798178709 sysctl: remove an unused variable
"links" is never used, so we can remove it.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-30 19:13:46 -08:00
Eric W. Biederman
fea478d410 sysctl: Add register_sysctl for normal sysctl users
The plan is to convert all callers of register_sysctl_table
and register_sysctl_paths to register_sysctl.  The interface
to register_sysctl is enough nicer this should make the callers
a bit more readable.  Additionally after the conversion the
230 lines of backwards compatibility can be removed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
ac13ac6f4c sysctl: Index sysctl directories with rbtrees.
One of the most important jobs of sysctl is to export network stack
tunables.  Several of those tunables are per network device.  In
several instances people are running with 1000+ network devices in
there network stacks, which makes the simple per directory linked list
in sysctl a scaling bottleneck.   Replace O(N^2) sysctl insertion and
lookup times with O(NlogN) by using an rbtree to index the sysctl
directories.

Benchmark before:
    make-dummies 0 999 -> 0.32s
    rmmod dummy        -> 0.12s
    make-dummies 0 9999 -> 1m17s
    rmmod dummy         -> 17s

Benchmark after:
    make-dummies 0 999 -> 0.074s
    rmmod dummy        -> 0.070s
    make-dummies 0 9999 -> 3.4s
    rmmod dummy         -> 0.44s

Benchmark after (without dev_snmp6):
    make-dummies 0 9999 -> 0.75s
    rmmod dummy         -> 0.44s
    make-dummies 0 99999 -> 11s
    rmmod dummy          -> 4.3s

At 10,000 dummy devices the bottleneck becomes the time to add and
remove the files under /proc/sys/net/dev_snmp6.  I have commented
out the code that adds and removes files under /proc/sys/net/dev_snmp6
and taken measurments of creating and destroying 100,000 dummies to
verify the sysctl continues to scale.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
9e3d47df35 sysctl: Make the header lists per directory.
Slightly enhance efficiency and clarity of the code by making the
header list per directory instead of per set.

Benchmark before:
    make-dummies 0 999 -> 0.63s
    rmmod dummy        -> 0.12s
    make-dummies 0 9999 -> 2m35s
    rmmod dummy         -> 18s

Benchmark after:
    make-dummies 0 999 -> 0.32s
    rmmod dummy        -> 0.12s
    make-dummies 0 9999 -> 1m17s
    rmmod dummy         -> 17s

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
e54012cede sysctl: Move sysctl_check_dups into insert_header
Simplify the callers of insert_header by removing explicit calls to check
for duplicates and instead have insert_header do the work.

This makes the code slightly more maintainable by enabling changes to
data structures where the insertion of new entries without duplicate
suppression is not possible.

There is not always a convenient path string where insert_header
is called so modify sysctl_check_dups to use sysctl_print_dir
when printing the full path when a duplicate is discovered.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
60a47a2e82 sysctl: Modify __register_sysctl_paths to take a set instead of a root and an nsproxy
An nsproxy argument here has always been awkard and now the nsproxy argument
is completely unnecessary so remove it, replacing it with the set we want
the registered tables to show up in.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
0e47c99d7f sysctl: Replace root_list with links between sysctl_table_sets.
Piecing together directories by looking first in one directory
tree, than in another directory tree and finally in a third
directory tree makes it hard to verify that some directory
entries are not multiply defined and makes it hard to create
efficient implementations the sysctl filesystem.

Replace the sysctl wide list of roots with autogenerated
links from the core sysctl directory tree to the other
sysctl directory trees.

This simplifies sysctl directory reading and lookups as now
only entries in a single sysctl directory tree need to be
considered.

Benchmark before:
    make-dummies 0 999 -> 0.44s
    rmmod dummy        -> 0.065s
    make-dummies 0 9999 -> 1m36s
    rmmod dummy         -> 0.4s

Benchmark after:
    make-dummies 0 999 -> 0.63s
    rmmod dummy        -> 0.12s
    make-dummies 0 9999 -> 2m35s
    rmmod dummy         -> 18s

The slowdown is caused by the lookups used in insert_headers
and put_links to see if we need to add links or remove links.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
6980128fe1 sysctl: Add sysctl_print_dir and use it in get_subdir
When there are errors it is very nice to know the full sysctl path.
Add a simple function that computes the sysctl path and prints it
out.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
7ec66d0636 sysctl: Stop requiring explicit management of sysctl directories
Simplify the code and the sysctl semantics by autogenerating
sysctl directories when a sysctl table is registered that needs
the directories and autodeleting the directories when there are
no more sysctl tables registered that need them.

Autogenerating directories keeps sysctl tables from depending
on each other, removing all of the arcane register/unregister
ordering constraints and makes it impossible to get the order
wrong when reigsering and unregistering sysctl tables.

Autogenerating directories yields one unique entity that dentries
can point to, retaining the current effective use of the dcache.

Add struct ctl_dir as the type of these new autogenerated
directories.

The attached_by and attached_to fields in ctl_table_header are
removed as they are no longer needed.

The child field in ctl_table is no longer needed by the core of
the sysctl code.  ctl_table.child can be removed once all of the
existing users have been updated.

Benchmark before:
    make-dummies 0 999 -> 0.7s
    rmmod dummy        -> 0.07s
    make-dummies 0 9999 -> 1m10s
    rmmod dummy         -> 0.4s

Benchmark after:
    make-dummies 0 999 -> 0.44s
    rmmod dummy        -> 0.065s
    make-dummies 0 9999 -> 1m36s
    rmmod dummy         -> 0.4s

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
9eb47c26f0 sysctl: Add a root pointer to ctl_table_set
Add a ctl_table_root pointer to ctl_table set so it is easy to
go from a ctl_table_set to a ctl_table_root.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
6a75ce167c sysctl: Rewrite proc_sys_readdir in terms of first_entry and next_entry
Replace sysctl_head_next with first_entry and next_entry.  These new
iterators operate at the level of sysctl table entries and filter
out any sysctl tables that should not be shown.

Utilizing two specialized functions instead of a single function removes
conditionals for handling awkward special cases that only come up
at the beginning of iteration, making the iterators easier to read
and understand.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
076c3eed2c sysctl: Rewrite proc_sys_lookup introducing find_entry and lookup_entry.
Replace the helpers that proc_sys_lookup uses with helpers that work
in terms of an entire sysctl directory.  This is worse for sysctl_lock
hold times but it is much better for code clarity and the code cleanups
to come.

find_in_table is no longer needed so it is removed.

find_entry a general helper to find entries in a directory is added.

lookup_entry is a simple wrapper around find_entry that takes the
sysctl_lock increases the use count if an entry is found and drops
the sysctl_lock.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
a194558e86 sysctl: Normalize the root_table data structure.
Every other directory has a .child member and we look at the .child
for our entries.  Do the same for the root_table.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
8425d6aaf0 sysctl: Factor out insert_header and erase_header
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
e0d045290a sysctl: Factor out init_header from __register_sysctl_paths
Factor out a routing to initialize the sysctl_table_header.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
938aaa4f92 sysctl: Initial support for auto-unregistering sysctl tables.
Add nreg to ctl_table_header.  When nreg drops to 0 the ctl_table_header
will be unregistered.

Factor out drop_sysctl_table from unregister_sysctl_table, and add
the logic for decrementing nreg.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
3cc3e04636 sysctl: A more obvious version of grab_header.
Instead of relying on sysct_head_next(NULL) to magically
return the right header for the root directory instead
explicitly transform NULL into the root directories header.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
8d6ecfcc01 sysctl: Remove the now unused ctl_table parent field.
While useful at one time for selinux and the sysctl sanity
checks those users no longer use the parent field and we can
safely remove it.

Inspired-by: Lucian Adrian Grijincu <lucian.grijincu@gmil.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
7c60c48f58 sysctl: Improve the sysctl sanity checks
- Stop validating subdirectories now that we only register leaf tables

- Cleanup and improve the duplicate filename check.
  * Run the duplicate filename check under the sysctl_lock to guarantee
    we never add duplicate names.
  * Reduce the duplicate filename check to nearly O(M*N) where M is the
    number of entries in tthe table we are registering and N is the
    number of entries in the directory before we got there.

- Move the duplicate filename check into it's own function and call
  it directtly from __register_sysctl_table

- Kill the config option as the sanity checks are now cheap enough
  the config option is unnecessary. The original reason for the config
  option was because we had a huge table used to verify the proc filename
  to binary sysctl mapping.  That table has now evolved into the binary_sysctl
  translation layer and is no longer part of the sysctl_check code.

- Tighten up the permission checks.  Guarnateeing that files only have read
  or write permissions.

- Removed redudant check for parents having a procname as now everything has
  a procname.

- Generalize the backtrace logic so that we print a backtrace from
  any failure of __register_sysctl_table that was not caused by
  a memmory allocation failure.  The backtrace allows us to track
  down who erroneously registered a sysctl table.

Bechmark before (CONFIG_SYSCTL_CHECK=y):
    make-dummies 0 999 -> 12s
    rmmod dummy        -> 0.08s

Bechmark before (CONFIG_SYSCTL_CHECK=n):
    make-dummies 0 999 -> 0.7s
    rmmod dummy        -> 0.06s
    make-dummies 0 99999 -> 1m13s
    rmmod dummy          -> 0.38s

Benchmark after:
    make-dummies 0 999 -> 0.65s
    rmmod dummy        -> 0.055s
    make-dummies 0 9999 -> 1m10s
    rmmod dummy         -> 0.39s

The sysctl sanity checks now impose no measurable cost.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:27 -08:00
Eric W. Biederman
f728019bb7 sysctl: register only tables of sysctl files
Split the registration of a complex ctl_table array which may have
arbitrary numbers of directories (->child != NULL) and tables of files
into a series of simpler registrations that only register tables of files.

Graphically:

   register('dir', { + file-a
                     + file-b
                     + subdir1
                       + file-c
                     + subdir2
                       + file-d
                       + file-e })

is transformed into:
   wrapper->subheaders[0] = register('dir', {file1-a, file1-b})
   wrapper->subheaders[1] = register('dir/subdir1', {file-c})
   wrapper->subheaders[2] = register('dir/subdir2', {file-d, file-e})
   return wrapper

This guarantees that __register_sysctl_table will only see a simple
ctl_table array with all entries having (->child == NULL).

Care was taken to pass the original simple ctl_table arrays to
__register_sysctl_table whenever possible.

This change is derived from a similar patch written
by Lucrian Grijincu.

Inspired-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:27 -08:00
Eric W. Biederman
ec6a52668d sysctl: Add ctl_table chains into cstring paths
For any component of table passed to __register_sysctl_paths
that actually serves as a path, add that to the cstring path
that is passed to __register_sysctl_table.

The result is that for most calls to __register_sysctl_paths
we only pass a table to __register_sysctl_table that contains
no child directories.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
6e9d516415 sysctl: Add support for register sysctl tables with a normal cstring path.
Make __register_sysctl_table the core sysctl registration operation and
make it take a char * string as path.

Now that binary paths have been banished into the real of backwards
compatibility in kernel/binary_sysctl.c where they can be safely
ignored there is no longer a need to use struct ctl_path to represent
path names when registering ctl_tables.

Start the transition to using normal char * strings to represent
pathnames when registering sysctl tables.  Normal strings are easier
to deal with both in the internal sysctl implementation and for
programmers registering sysctl tables.

__register_sysctl_paths is turned into a backwards compatibility wrapper
that converts a ctl_path array into a normal char * string.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
f05e53a7fb sysctl: Create local copies of directory names used in paths
Creating local copies of directory names is a good idea for
two reasons.
- The dynamic names used by callers must be copied into new
  strings by the callers today to ensure the strings do not
  change between register and unregister of the sysctl table.

- Sysctl directories have a potentially different lifetime
  than the time between register and unregister of any
  particular sysctl table.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
bd295b56cf sysctl: Remove the unnecessary sysctl_set parent concept.
In sysctl_net register the two networking roots in the proper order.

In register_sysctl walk the sysctl sets in the reverse order of the
sysctl roots.

Remove parent from ctl_table_set and setup_sysctl_set as it is no
longer needed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
97324cd804 sysctl: Implement retire_sysctl_set
This adds a small helper retire_sysctl_set to remove the intimate knowledge about
the how a sysctl_set is implemented from net/sysct_net.c

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
a15e20982e sysctl: Make the directories have nlink == 1
I goofed when I made sysctl directories have nlink == 0.
nlink == 0 means the directory has been deleted.
nlink == 1 meands a directory does not count subdirectories.

Use the default nlink == 1 for sysctl directories.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
1f87f0b52b sysctl: Move the implementation into fs/proc/proc_sysctl.c
Move the core sysctl code from kernel/sysctl.c and kernel/sysctl_check.c
into fs/proc/proc_sysctl.c.

Currently sysctl maintenance is hampered by the sysctl implementation
being split across 3 files with artificial layering between them.
Consolidate the entire sysctl implementation into 1 file so that
it is easier to see what is going on and hopefully allowing for
simpler maintenance.

For functions that are now only used in fs/proc/proc_sysctl.c remove
their declarations from sysctl.h and make them static in fs/proc/proc_sysctl.c

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:54 -08:00
Eric W. Biederman
de4e83bd6b sysctl: Register the base sysctl table like any other sysctl table.
Simplify the code by treating the base sysctl table like any other
sysctl table and register it with register_sysctl_table.

To ensure this table is registered early enough to avoid problems
call sysctl_init from proc_sys_init.

Rename sysctl_net.c:sysctl_init() to net_sysctl_init() to avoid
name conflicts now that kernel/sysctl.c:sysctl_init() is no longer
static.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:54 -08:00
Lucas De Marchi
36885d7b11 sysctl: remove impossible condition check
Remove checks for conditions that will never happen. If procname is NULL
the loop would already had bailed out, so there's no need to check it
again.

At the same time this also compacts the function find_in_table() by
refactoring it to be easier to read.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:54 -08:00
Will Deacon
85e72aa538 proc: clear_refs: do not clear reserved pages
/proc/pid/clear_refs is used to clear the Referenced and YOUNG bits for
pages and corresponding page table entries of the task with PID pid, which
includes any special mappings inserted into the page tables in order to
provide things like vDSOs and user helper functions.

On ARM this causes a problem because the vectors page is mapped as a
global mapping and since ec706dab ("ARM: add a vma entry for the user
accessible vector page"), a VMA is also inserted into each task for this
page to aid unwinding through signals and syscall restarts.  Since the
vectors page is required for handling faults, clearing the YOUNG bit (and
subsequently writing a faulting pte) means that we lose the vectors page
*globally* and cannot fault it back in.  This results in a system deadlock
on the next exception.

To see this problem in action, just run:

	$ echo 1 > /proc/self/clear_refs

on an ARM platform (as any user) and watch your system hang.  I think this
has been the case since 2.6.37

This patch avoids clearing the aforementioned bits for reserved pages,
therefore leaving the vectors page intact on ARM.  Since reserved pages
are not candidates for swap, this change should not have any impact on the
usefulness of clear_refs.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Reported-by: Moussa Ba <moussaba@micron.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Acked-by: Nicolas Pitre <nico@linaro.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: <stable@vger.kernel.org>		[2.6.37+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23 08:38:48 -08:00
Linus Torvalds
567e47935a Merge branches 'sched-urgent-for-linus', 'perf-urgent-for-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/accounting, proc: Fix /proc/stat interrupts sum

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracepoints/module: Fix disabling tracepoints with taint CRAP or OOT
  x86/kprobes: Add arch/x86/tools/insn_sanity to .gitignore
  x86/kprobes: Fix typo transferred from Intel manual

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, syscall: Need __ARCH_WANT_SYS_IPC for 32 bits
  x86, tsc: Fix SMI induced variation in quick_pit_calibrate()
  x86, opcode: ANDN and Group 17 in x86-opcode-map.txt
  x86/kconfig: Move the ZONE_DMA entry under a menu
  x86/UV2: Add accounting for BAU strong nacks
  x86/UV2: Ack BAU interrupt earlier
  x86/UV2: Remove stale no-resources test for UV2 BAU
  x86/UV2: Work around BAU bug
  x86/UV2: Fix BAU destination timeout initialization
  x86/UV2: Fix new UV2 hardware by using native UV2 broadcast mode
  x86: Get rid of dubious one-bit signed bitfield
2012-01-19 14:53:06 -08:00
Linus Torvalds
f429ee3b80 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit: (29 commits)
  audit: no leading space in audit_log_d_path prefix
  audit: treat s_id as an untrusted string
  audit: fix signedness bug in audit_log_execve_info()
  audit: comparison on interprocess fields
  audit: implement all object interfield comparisons
  audit: allow interfield comparison between gid and ogid
  audit: complex interfield comparison helper
  audit: allow interfield comparison in audit rules
  Kernel: Audit Support For The ARM Platform
  audit: do not call audit_getname on error
  audit: only allow tasks to set their loginuid if it is -1
  audit: remove task argument to audit_set_loginuid
  audit: allow audit matching on inode gid
  audit: allow matching on obj_uid
  audit: remove audit_finish_fork as it can't be called
  audit: reject entry,always rules
  audit: inline audit_free to simplify the look of generic code
  audit: drop audit_set_macxattr as it doesn't do anything
  audit: inline checks for not needing to collect aux records
  audit: drop some potentially inadvisable likely notations
  ...

Use evil merge to fix up grammar mistakes in Kconfig file.

Bad speling and horrible grammar (and copious swearing) is to be
expected, but let's keep it to commit messages and comments, rather than
expose it to users in config help texts or printouts.
2012-01-17 16:41:31 -08:00
Linus Torvalds
e268337dfe proc: clean up and fix /proc/<pid>/mem handling
Jüri Aedla reported that the /proc/<pid>/mem handling really isn't very
robust, and it also doesn't match the permission checking of any of the
other related files.

This changes it to do the permission checks at open time, and instead of
tracking the process, it tracks the VM at the time of the open.  That
simplifies the code a lot, but does mean that if you hold the file
descriptor open over an execve(), you'll continue to read from the _old_
VM.

That is different from our previous behavior, but much simpler.  If
somebody actually finds a load where this matters, we'll need to revert
this commit.

I suspect that nobody will ever notice - because the process mapping
addresses will also have changed as part of the execve.  So you cannot
actually usefully access the fd across a VM change simply because all
the offsets for IO would have changed too.

Reported-by: Jüri Aedla <asd@ut.ee>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-17 15:21:19 -08:00
Eric Paris
633b454545 audit: only allow tasks to set their loginuid if it is -1
At the moment we allow tasks to set their loginuid if they have
CAP_AUDIT_CONTROL.  In reality we want tasks to set the loginuid when they
log in and it be impossible to ever reset.  We had to make it mutable even
after it was once set (with the CAP) because on update and admin might have
to restart sshd.  Now sshd would get his loginuid and the next user which
logged in using ssh would not be able to set his loginuid.

Systemd has changed how userspace works and allowed us to make the kernel
work the way it should.  With systemd users (even admins) are not supposed
to restart services directly.  The system will restart the service for
them.  Thus since systemd is going to loginuid==-1, sshd would get -1, and
sshd would be allowed to set a new loginuid without special permissions.

If an admin in this system were to manually start an sshd he is inserting
himself into the system chain of trust and thus, logically, it's his
loginuid that should be used!  Since we have old systems I make this a
Kconfig option.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:00 -05:00
Eric Paris
0a300be6d5 audit: remove task argument to audit_set_loginuid
The function always deals with current.  Don't expose an option
pretending one can use it for something.  You can't.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:00 -05:00
Russell King
f7e6746eba sched/accounting, proc: Fix /proc/stat interrupts sum
Commit 3292beb340 ("sched/accounting: Change cpustat fields to an array")
deleted the code which provides us with the sum of all interrupts in the
system, causing vmstat to report zero interrupts occuring in the system.

Fix this by restoring the code.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Tested-by: Russell King <rmk+kernel@arm.linux.org.uk> # [on ARM]
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Glauber Costa <glommer@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Tuner <pjt@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-16 08:13:27 +01:00
Linus Torvalds
c49c41a413 Merge branch 'for-linus' of git://selinuxproject.org/~jmorris/linux-security
* 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
  capabilities: remove __cap_full_set definition
  security: remove the security_netlink_recv hook as it is equivalent to capable()
  ptrace: do not audit capability check when outputing /proc/pid/stat
  capabilities: remove task_ns_* functions
  capabitlies: ns_capable can use the cap helpers rather than lsm call
  capabilities: style only - move capable below ns_capable
  capabilites: introduce new has_ns_capabilities_noaudit
  capabilities: call has_ns_capability from has_capability
  capabilities: remove all _real_ interfaces
  capabilities: introduce security_capable_noaudit
  capabilities: reverse arguments to security_capable
  capabilities: remove the task from capable LSM hook entirely
  selinux: sparse fix: fix several warnings in the security server cod
  selinux: sparse fix: fix warnings in netlink code
  selinux: sparse fix: eliminate warnings for selinuxfs
  selinux: sparse fix: declare selinux_disable() in security.h
  selinux: sparse fix: move selinux_complete_init
  selinux: sparse fix: make selinux_secmark_refcount static
  SELinux: Fix RCU deref check warning in sel_netport_insert()

Manually fix up a semantic mis-merge wrt security_netlink_recv():

 - the interface was removed in commit fd77846152 ("security: remove
   the security_netlink_recv hook as it is equivalent to capable()")

 - a new user of it appeared in commit a38f7907b9 ("crypto: Add
   userspace configuration API")

causing no automatic merge conflict, but Eric Paris pointed out the
issue.
2012-01-14 18:36:33 -08:00
Cyrill Gorcunov
b3f7f573a2 c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
involved into calculation of program text/data segment sizes (which might
be seen in /proc/<pid>/statm) and into brk() call final address.

For restore we need to know all these values.  While
mm->start_code/end_code already present in /proc/$pid/stat, the rest
members are not, so this patch brings them in.

The restore procedure of these members is addressed in another patch using
prctl().

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:13 -08:00
Xiaotian Feng
a2ef990ab5 proc: fix null pointer deref in proc_pid_permission()
get_proc_task() can fail to search the task and return NULL,
put_task_struct() will then bomb the kernel with following oops:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
  IP: [<ffffffff81217d34>] proc_pid_permission+0x64/0xe0
  PGD 112075067 PUD 112814067 PMD 0
  Oops: 0002 [#1] PREEMPT SMP

This is a regression introduced by commit 0499680a ("procfs: add hidepid=
and gid= mount options").  The kernel should return -ESRCH if
get_proc_task() failed.

Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Stephen Wilson <wilsons@start.ca>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:02 -08:00
Vasiliy Kulikov
0499680a42 procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories.  The default backward-compatible "relaxed" behaviour is left
untouched.

The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:

hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.

hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own.  Sensitive files like cmdline, sched*, status are now protected
against other users.  As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.

hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users.  It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g.  by kill -0 $PID), but it hides process' euid
and egid.  It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.

gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode).  This group should be used instead of putting
nonroot user in sudoers file or something.  However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.

hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:

http://www.openwall.com/lists/oss-security/2011/11/05/3

hidepid=1/2 doesn't break monitoring userspace tools.  ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes.  pstree shows the process subtree which
contains "pstree" process.

Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368).  We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10 16:30:54 -08:00
Vasiliy Kulikov
97412950b1 procfs: parse mount options
Add support for procfs mount options.  Actual mount options are coming in
the next patches.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10 16:30:54 -08:00
Pavel Emelyanov
640708a2cf procfs: introduce the /proc/<pid>/map_files/ directory
This one behaves similarly to the /proc/<pid>/fd/ one - it contains
symlinks one for each mapping with file, the name of a symlink is
"vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
results in a file that point exactly to the same inode as them vma's one.

For example the ls -l of some arbitrary /proc/<pid>/map_files/

 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

This *helps* checkpointing process in three ways:

1. When dumping a task mappings we do know exact file that is mapped
   by particular region.  We do this by opening
   /proc/$pid/map_files/$address symlink the way we do with file
   descriptors.

2. This also helps in determining which anonymous shared mappings are
   shared with each other by comparing the inodes of them.

3. When restoring a set of processes in case two of them has a mapping
   shared, we map the memory by the 1st one and then open its
   /proc/$pid/map_files/$address file and map it by the 2nd task.

Using /proc/$pid/maps for this is quite inconvenient since it brings
repeatable re-reading and reparsing for this text file which slows down
restore procedure significantly.  Also as being pointed in (3) it is a way
easier to use top level shared mapping in children as
/proc/$pid/map_files/$address when needed.

[akpm@linux-foundation.org: coding-style fixes]
[gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10 16:30:54 -08:00
Cyrill Gorcunov
7773fbc541 procfs: make proc_get_link to use dentry instead of inode
Prepare the ground for the next "map_files" patch which needs a name of a
link file to analyse.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10 16:30:54 -08:00
KAMEZAWA Hiroyuki
43d2b11324 tracepoint: add tracepoints for debugging oom_score_adj
oom_score_adj is used for guarding processes from OOM-Killer.  One of
problem is that it's inherited at fork().  When a daemon set oom_score_adj
and make children, it's hard to know where the value is set.

This patch adds some tracepoints useful for debugging. This patch adds
3 trace points.
  - creating new task
  - renaming a task (exec)
  - set oom_score_adj

To debug, users need to enable some trace pointer. Maybe filtering is useful as

# EVENT=/sys/kernel/debug/tracing/events/task/
# echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
# echo "oom_score_adj != 0" > $EVENT/task_rename/filter
# echo 1 > $EVENT/enable
# EVENT=/sys/kernel/debug/tracing/events/oom/
# echo 1 > $EVENT/enable

output will be like this.
# grep oom /sys/kernel/debug/tracing/trace
bash-7699  [007] d..3  5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
bash-7699  [007] ...1  5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
ls-7729  [003] ...2  5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
bash-7699  [002] ...1  5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
grep-7730  [007] ...2  5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10 16:30:44 -08:00
Linus Torvalds
972b2c7199 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
  reiserfs: Properly display mount options in /proc/mounts
  vfs: prevent remount read-only if pending removes
  vfs: count unlinked inodes
  vfs: protect remounting superblock read-only
  vfs: keep list of mounts for each superblock
  vfs: switch ->show_options() to struct dentry *
  vfs: switch ->show_path() to struct dentry *
  vfs: switch ->show_devname() to struct dentry *
  vfs: switch ->show_stats to struct dentry *
  switch security_path_chmod() to struct path *
  vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
  vfs: trim includes a bit
  switch mnt_namespace ->root to struct mount
  vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
  vfs: opencode mntget() mnt_set_mountpoint()
  vfs: spread struct mount - remaining argument of next_mnt()
  vfs: move fsnotify junk to struct mount
  vfs: move mnt_devname
  vfs: move mnt_list to struct mount
  vfs: switch pnode.h macros to struct mount *
  ...
2012-01-08 12:19:57 -08:00
Al Viro
ece2ccb668 Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into Z 2012-01-06 23:15:54 -05:00
Linus Torvalds
0db49b72bc Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  sched/tracing: Add a new tracepoint for sleeptime
  sched: Disable scheduler warnings during oopses
  sched: Fix cgroup movement of waking process
  sched: Fix cgroup movement of newly created process
  sched: Fix cgroup movement of forking process
  sched: Remove cfs bandwidth period check in tg_set_cfs_period()
  sched: Fix load-balance lock-breaking
  sched: Replace all_pinned with a generic flags field
  sched: Only queue remote wakeups when crossing cache boundaries
  sched: Add missing rcu_dereference() around ->real_parent usage
  [S390] fix cputime overflow in uptime_proc_show
  [S390] cputime: add sparse checking and cleanup
  sched: Mark parent and real_parent as __rcu
  sched, nohz: Fix missing RCU read lock
  sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer
  sched, nohz: Fix the idle cpu check in nohz_idle_balance
  sched: Use jump_labels for sched_feat
  sched/accounting: Fix parameter passing in task_group_account_field
  sched/accounting: Fix user/system tick double accounting
  sched/accounting: Re-use scheduler statistics for the root cgroup
  ...

Fix up conflicts in
 - arch/ia64/include/asm/cputime.h, include/asm-generic/cputime.h
	usecs_to_cputime64() vs the sparse cleanups
 - kernel/sched/fair.c, kernel/time/tick-sched.c
	scheduler changes in multiple branches
2012-01-06 08:44:54 -08:00
Eric Paris
69f594a389 ptrace: do not audit capability check when outputing /proc/pid/stat
Reading /proc/pid/stat of another process checks if one has ptrace permissions
on that process.  If one does have permissions it outputs some data about the
process which might have security and attack implications.  If the current
task does not have ptrace permissions the read still works, but those fields
are filled with inocuous (0) values.  Since this check and a subsequent denial
is not a violation of the security policy we should not audit such denials.

This can be quite useful to removing ptrace broadly across a system without
flooding the logs when ps is run or something which harmlessly walks proc.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
2012-01-05 18:53:00 -05:00
Al Viro
d10577a8d8 vfs: trim includes a bit
[folded fix for missing magic.h from Tetsuo Handa]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:13 -05:00
Al Viro
0226f4923f vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
rationale: that stuff is far tighter bound to fs/namespace.c than to
the guts of procfs proper.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:13 -05:00
Al Viro
d161a13f97 switch procfs to umode_t use
both proc_dir_entry ->mode and populating functions

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:56 -05:00
Al Viro
6b520e0565 vfs: fix the stupidity with i_dentry in inode destructors
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else.  Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:40 -05:00
Andreas Schwab
34845636a1 procfs: do not confuse jiffies with cputime64_t
Commit 2a95ea6c0d ("procfs: do not overflow get_{idle,iowait}_time
for nohz") did not take into account that one some architectures jiffies
and cputime use different units.

This causes get_idle_time() to return numbers in the wrong units, making
the idle time fields in /proc/stat wrong.

Instead of converting the usec value returned by
get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
usecs_to_cputime64 to convert it to the correct unit of cputime64_t.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Artem S. Tashkinov" <t.artem@mailcity.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-29 16:31:57 -08:00
Martin Schwidefsky
612ef28a04 Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into cputime-tip
Conflicts:
	drivers/cpufreq/cpufreq_conservative.c
	drivers/cpufreq/cpufreq_ondemand.c
	drivers/macintosh/rack-meter.c
	fs/proc/stat.c
	fs/proc/uptime.c
	kernel/sched/core.c
2011-12-19 19:23:15 +01:00
Martin Schwidefsky
c3e0ef9a29 [S390] fix cputime overflow in uptime_proc_show
For 32-bit architectures using standard jiffies the idletime calculation
in uptime_proc_show will quickly overflow. It takes (2^32 / HZ) seconds
of idle-time, or e.g. 12.45 days with no load on a quad-core with HZ=1000.
Switch to 64-bit calculations.

Cc: stable@vger.kernel.org
Cc: Michael Abbott <michael.abbott@diamond.ac.uk>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-12-15 14:56:19 +01:00
Martin Schwidefsky
648616343c [S390] cputime: add sparse checking and cleanup
Make cputime_t and cputime64_t nocast to enable sparse checking to
detect incorrect use of cputime. Drop the cputime macros for simple
scalar operations. The conversion macros are still needed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-12-15 14:56:19 +01:00
Ingo Molnar
6a54aebf69 Merge commit 'v3.2-rc5' into sched/core
Merge reason: Pick up the latest fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-15 08:21:30 +01:00
Linus Torvalds
ddb360778a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/ncpfs: fix error paths and goto statements in ncp_fill_super()
  configfs: register_filesystem() called too early
  fuse: register_filesystem() called too early
  ubifs: too early register_filesystem()
  ... and the same kind of leak for mqueue
  procfs: fix a vfsmount longterm reference leak
2011-12-14 18:22:55 -08:00
Michal Hocko
2a95ea6c0d procfs: do not overflow get_{idle,iowait}_time for nohz
Since commit a25cac5198 ("proc: Consider NO_HZ when printing idle and
iowait times") we are reporting idle/io_wait time also while a CPU is
tickless.  We rely on get_{idle,iowait}_time functions to retrieve
proper data.

These functions, however, use usecs_to_cputime to translate micro
seconds time to cputime64_t.  This is just an alias to usecs_to_jiffies
which reduces the data type from u64 to unsigned int and also checks
whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
and returns MAX_JIFFY_OFFSET in that case.

When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
>3000s! until we overflow unsigned int.  Just for reference
CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
CONFIG_HZ_1000 ~2s.

This results in a bug when people saw [h]top going mad reporting 100%
CPU usage even though there was basically no CPU load.  The reason was
simply that /proc/stat stopped reporting idle/io_wait changes (and
reported MAX_JIFFY_OFFSET) and so the only change happening was for user
system time.

Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
to 32b type and it is much more appropriate for cumulative time values
(unlike usecs_to_jiffies which intended for timeout calculations).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Artem S. Tashkinov <t.artem@mailcity.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-09 07:50:29 -08:00
Claudio Scordino
b53fc7c297 fs/proc/meminfo.c: fix compilation error
Fix the error message "directives may not be used inside a macro argument"
which appears when the kernel is compiled for the cris architecture.

Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-09 07:50:28 -08:00
Al Viro
905ad269c5 procfs: fix a vfsmount longterm reference leak
kern_mount() doesn't pair with plain mntput()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-12-09 00:40:19 -05:00
Glauber Costa
3292beb340 sched/accounting: Change cpustat fields to an array
This patch changes fields in cpustat from a structure, to an
u64 array. Math gets easier, and the code is more flexible.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Tuner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1322498719-2255-2-git-send-email-glommer@parallels.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06 09:06:38 +01:00
Linus Torvalds
5e442a493f Revert "proc: fix races against execve() of /proc/PID/fd**"
This reverts commit aa6afca5bc.

It escalates of some of the google-chrome SELinux problems with ptrace
("Check failed: pid_ > 0.  Did not find zygote process"), and Andrew
says that it is also causing mystery lockdep reports.

Reported-by: Alex Villacís Lasso <a_villacis@palosanto.com>
Requested-by: James Morris <jmorris@namei.org>
Requested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-09 18:16:00 -05:00
Linus Torvalds
32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Linus Torvalds
092f4c56c1 Merge branch 'akpm' (Andrew's incoming - part two)
Says Andrew:

 "60 patches.  That's good enough for -rc1 I guess.  I have quite a lot
  of detritus to be rechecked, work through maintainers, etc.

 - most of the remains of MM
 - rtc
 - various misc
 - cgroups
 - memcg
 - cpusets
 - procfs
 - ipc
 - rapidio
 - sysctl
 - pps
 - w1
 - drivers/misc
 - aio"

* akpm: (60 commits)
  memcg: replace ss->id_lock with a rwlock
  aio: allocate kiocbs in batches
  drivers/misc/vmw_balloon.c: fix typo in code comment
  drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
  w1: disable irqs in critical section
  drivers/w1/w1_int.c: multiple masters used same init_name
  drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
  drivers/power/ds2780_battery.c: add a nolock function to w1 interface
  drivers/power/ds2780_battery.c: create central point for calling w1 interface
  w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
  pps gpio client: add missing dependency
  pps: new client driver using GPIO
  pps: default echo function
  include/linux/dma-mapping.h: add dma_zalloc_coherent()
  sysctl: make CONFIG_SYSCTL_SYSCALL default to n
  sysctl: add support for poll()
  RapidIO: documentation update
  drivers/net/rionet.c: fix ethernet address macros for LE platforms
  RapidIO: fix potential null deref in rio_setup_device()
  RapidIO: add mport driver for Tsi721 bridge
  ...
2011-11-02 16:07:27 -07:00
Lucas De Marchi
f1ecf06854 sysctl: add support for poll()
Adding support for poll() in sysctl fs allows userspace to receive
notifications of changes in sysctl entries.  This adds a infrastructure to
allow files in sysctl fs to be pollable and implements it for hostname and
domainname.

[akpm@linux-foundation.org: s/declare/define/ for definitions]
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Greg KH <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:07:02 -07:00
Vasiliy Kulikov
aa6afca5bc proc: fix races against execve() of /proc/PID/fd**
fd* files are restricted to the task's owner, and other users may not get
direct access to them.  But one may open any of these files and run any
setuid program, keeping opened file descriptors.  As there are permission
checks on open(), but not on readdir() and read(), operations on the kept
file descriptors will not be checked.  It makes it possible to violate
procfs permission model.

Reading fdinfo/* may disclosure current fds' position and flags, reading
directory contents of fdinfo/ and fd/ may disclosure the number of opened
files by the target task.  This information is not sensible per se, but it
can reveal some private information (like length of a password stored in a
file) under certain conditions.

Used existing (un)lock_trace functions to check for ptrace_may_access(),
but instead of using EPERM return code from it use EACCES to be consistent
with existing proc_pid_follow_link()/proc_pid_readlink() return code.  If
they differ, attacker can guess what fds exist by analyzing stat() return
code.  Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
readdir() and lookup() for fd/ and fdinfo/.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:07:00 -07:00
Pavel Emelyanov
887df07891 procfs: report EISDIR when reading sysctl dirs in proc
On reading sysctl dirs we should return -EISDIR instead of -EINVAL.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:07:00 -07:00
Miklos Szeredi
bfe8684869 filesystems: add set_nlink()
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:43 +01:00
Miklos Szeredi
6d6b77f163 filesystems: add missing nlink wrappers
Replace direct i_nlink updates with the respective updater function
(inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-11-02 12:53:43 +01:00
Christoph Lameter
bc3e53f682 mm: distinguish between mlocked and pinned pages
Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".

The difference between mlocking and pinning is:

A. mlocked pages are marked with PG_mlocked and are exempt from
   swapping. Page migration may move them around though.
   They are kept on a special LRU list.

B. Pinned pages cannot be moved because something needs to
   directly access physical memory. They may not be on any
   LRU list.

I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:

Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.

This patch introduces a separate counter for pinned pages and
accounts them seperately.

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Mike Marciniszyn <infinipath@qlogic.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:46 -07:00
David Rientjes
c9f01245b6 oom: remove oom_disable_count
This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy.  The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing.  The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

 - it is possible that the OOM_DISABLE thread sharing memory with the
   victim is waiting on that thread to exit and will actually cause
   future memory freeing, and

 - there is no guarantee that a thread is disabled from oom killing just
   because another thread sharing its mm is oom disabled.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:45 -07:00
Andrew Morton
fc360bd9cd /proc/self/numa_maps: restore "huge" tag for hugetlb vmas
The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm:
use walk_page_range() instead of custom page table walking code").

Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Stephen Wilson <wilsons@start.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:44 -07:00
Paul Gortmaker
afeacc8c1f fs: add export.h to files using EXPORT_SYMBOL/THIS_MODULE macros
These files were getting <linux/module.h> via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason.  Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:31 -04:00
Linus Torvalds
39adff5f69 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  time, s390: Get rid of compile warning
  dw_apb_timer: constify clocksource name
  time: Cleanup old CONFIG_GENERIC_TIME references that snuck in
  time: Change jiffies_to_clock_t() argument type to unsigned long
  alarmtimers: Fix error handling
  clocksource: Make watchdog reset lockless
  posix-cpu-timers: Cure SMP accounting oddities
  s390: Use direct ktime path for s390 clockevent device
  clockevents: Add direct ktime programming function
  clockevents: Make minimum delay adjustments configurable
  nohz: Remove "Switched to NOHz mode" debugging messages
  proc: Consider NO_HZ when printing idle and iowait times
  nohz: Make idle/iowait counter update conditional
  nohz: Fix update_ts_time_stat idle accounting
  cputime: Clean up cputime_to_usecs and usecs_to_cputime macros
  alarmtimers: Rework RTC device selection using class interface
  alarmtimers: Add try_to_cancel functionality
  alarmtimers: Add more refined alarm state tracking
  alarmtimers: Remove period from alarm structure
  alarmtimers: Remove interval cap limit hack
  ...
2011-10-26 17:15:03 +02:00
Dave Hansen
32ef43848f teach /proc/$pid/numa_maps about transparent hugepages
This is modeled after the smaps code.

It detects transparent hugepages and then does a single gather_stats()
for the page as a whole.  This has two benifits:
 1. It is more efficient since it does many pages in a single shot.
 2. It does not have to break down the huge page.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-21 13:15:44 -07:00
Dave Hansen
3200a8aaab break out numa_maps gather_pte_stats() checks
gather_pte_stats() does a number of checks on a target page
to see whether it should even be considered for statistics.
This breaks that code out in to a separate function so that
we can use it in the transparent hugepage case in the next
patch.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-21 13:15:44 -07:00
Dave Hansen
eb4866d006 make /proc/$pid/numa_maps gather_stats() take variable page size
We need to teach the numa_maps code about transparent huge pages.  The
first step is to teach gather_stats() that the pte it is dealing with
might represent more than one page.

Note that will we use this in a moment for transparent huge pages since
they have use a single pmd_t which _acts_ as a "surrogate" for a bunch
of smaller pte_t's.

I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes
for hugetlbfs pages and PAGE_SIZE for normal pages.  That means that to
figure out how many _bytes_ "dirty=1" means, you must first know the
hugetlbfs page size.  That's easier said than done especially if you
don't have visibility in to the mount.

But, that's probably a discussion for another day especially since it
would change behavior to fix it.  But, just in case anyone wonders why
this patch only passes a '1' in the hugetlb case...

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-21 13:15:44 -07:00
Michal Hocko
a25cac5198 proc: Consider NO_HZ when printing idle and iowait times
show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
statistics when priting information about idle and iowait times.
This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
counters are updated periodically.
With NO_HZ things got more tricky because we are not doing idle/iowait
accounting while we are tickless so the value might get outdated.
Users of /proc/stat will notice that by unchanged idle/iowait values
which is then interpreted as 0% idle/iowait time. From the user space
POV this is an unexpected behavior and a change of the interface.

Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
total idle/iowait time since boot and it doesn't rely on sampling or any
other periodic activity. Fall back to the previous behavior if NO_HZ is
disabled or not configured.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Jones <davej@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.cz
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08 11:10:55 +02:00
Linus Torvalds
1117f72ea0 vfs: show O_CLOEXE bit properly in /proc/<pid>/fdinfo/<fd> files
The CLOEXE bit is magical, and for performance (and semantic) reasons we
don't actually maintain it in the file descriptor itself, but in a
separate bit array.  Which means that when we show f_flags, the CLOEXE
status is shown incorrectly: we show the status not as it is now, but as
it was when the file was opened.

Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
array.

Uli needs this in order to re-implement the pfiles program:

  "For normal file descriptors (not sockets) this was the last piece of
   information which wasn't available.  This is all part of my 'give
   Solaris users no reason to not switch' effort.  I intend to offer the
   code to the util-linux-ng maintainers."

Requested-by: Ulrich Drepper <drepper@akkadia.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-06 11:51:33 -07:00
Linus Torvalds
c21427043d oom_ajd: don't use WARN_ONCE, just use printk_once
WARN_ONCE() is very annoying, in that it shows the stack trace that we
don't care about at all, and also triggers various user-level "kernel
oopsed" logic that we really don't care about.  And it's not like the
user can do anything about the applications (sshd) in question, it's a
distro issue.

Requested-by: Andi Kleen <andi@firstfloor.org> (and many others)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-06 11:43:08 -07:00
David Howells
09570f9149 proc: make struct proc_dir_entry::name a terminal array rather than a pointer
Since __proc_create() appends the name it is given to the end of the PDE
structure that it allocates, there isn't a need to store a name pointer.
Instead we can just replace the name pointer with a terminal char array of
_unspecified_ length.  The compiler will simply append the string to statically
defined variables of PDE type overlapping any hole at the end of the structure
and, unlike specifying an explicitly _zero_ length array, won't give a warning
if you try to statically initialise it with a string of more than zero length.

Also, whilst we're at it:

 (1) Move namelen to end just prior to name and reduce it to a single byte
     (name shouldn't be longer than NAME_MAX).

 (2) Move pde_unload_lock two places further on so that if it's four bytes in
     size on a 64-bit machine, it won't cause an unused hole in the PDE struct.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-27 12:50:45 -07:00
Arun Sharma
60063497a9 atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:47 -07:00
Vasiliy Kulikov
293eb1e777 proc: fix a race in do_io_accounting()
If an inode's mode permits opening /proc/PID/io and the resulting file
descriptor is kept across execve() of a setuid or similar binary, the
ptrace_may_access() check tries to prevent using this fd against the
task with escalated privileges.

Unfortunately, there is a race in the check against execve().  If
execve() is processed after the ptrace check, but before the actual io
information gathering, io statistics will be gathered from the
privileged process.  At least in theory this might lead to gathering
sensible information (like ssh/ftp password length) that wouldn't be
available otherwise.

Holding task->signal->cred_guard_mutex while gathering the io
information should protect against the race.

The order of locking is similar to the one inside of ptrace_attach():
first goes cred_guard_mutex, then lock_task_sighand().

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:43 -07:00
Daisuke Ogino
d2857e79a2 procfs: return ENOENT on opening a being-removed proc entry
Change the return value to ENOENT.  This return value is then returned
when opening the proc entry that have been removed.  For example,
open("/proc/bus/pci/XX/YY") when the corresponding device is being
hot-removed.

Signed-off-by: Daisuke Ogino <ogino.daisuke@jp.fujitsu.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:43 -07:00
David Rientjes
be8f684d73 oom: make deprecated use of oom_adj more verbose
/proc/pid/oom_adj is deprecated and scheduled for removal in August 2012
according to Documentation/feature-removal-schedule.txt.

This patch makes the warning more verbose by making it appear as a more
serious problem (the presence of a stack trace and being multiline should
attract more attention) so that applications still using the old interface
can get fixed.

Very popular users of the old interface have been converted since the oom
killer rewrite has been introduced.  udevd switched to the
/proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and
opensshd switched in 5.7p1.

At the start of 2012, this should be changed into a WARN() to emit all
such incidents and then finally remove the tunable in August 2012 as
scheduled.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25 20:57:09 -07:00
Linus Torvalds
bbd9d6f7fb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
  vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
  isofs: Remove global fs lock
  jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
  fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
  mm/truncate.c: fix build for CONFIG_BLOCK not enabled
  fs:update the NOTE of the file_operations structure
  Remove dead code in dget_parent()
  AFS: Fix silly characters in a comment
  switch d_add_ci() to d_splice_alias() in "found negative" case as well
  simplify gfs2_lookup()
  jfs_lookup(): don't bother with . or ..
  get rid of useless dget_parent() in btrfs rename() and link()
  get rid of useless dget_parent() in fs/btrfs/ioctl.c
  fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
  drivers: fix up various ->llseek() implementations
  fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
  Ext4: handle SEEK_HOLE/SEEK_DATA generically
  Btrfs: implement our own ->llseek
  fs: add SEEK_HOLE and SEEK_DATA flags
  reiserfs: make reiserfs default to barrier=flush
  ...

Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
2011-07-22 19:02:39 -07:00
Linus Torvalds
8209f53d79 Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
  ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
  ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
  connector: add an event for monitoring process tracers
  ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
  ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
  ptrace_init_task: initialize child->jobctl explicitly
  has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
  ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
  ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
  ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
  ptrace: ptrace_reparented() should check same_thread_group()
  redefine thread_group_leader() as exit_signal >= 0
  do not change dead_task->exit_signal
  kill task_detached()
  reparent_leader: check EXIT_DEAD instead of task_detached()
  make do_notify_parent() __must_check, update the callers
  __ptrace_detach: avoid task_detached(), check do_notify_parent()
  kill tracehook_notify_death()
  make do_notify_parent() return bool
  ptrace: s/tracehook_tracer_task()/ptrace_parent()/
  ...
2011-07-22 15:06:50 -07:00
Kay Sievers
f15146380d fs: seq_file - add event counter to simplify poll() support
Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.

All current users are switched over to use the new counter.

Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:47:50 -04:00
Al Viro
10556cb21a ->permission() sanitizing: don't pass flags to ->permission()
not used by the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:24 -04:00
Al Viro
2830ba7f34 ->permission() sanitizing: don't pass flags to generic_permission()
redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:22 -04:00
Al Viro
1fc0f78ca9 ->permission() sanitizing: MAY_NOT_BLOCK
Duplicate the flags argument into mask bitmap.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:18 -04:00
Al Viro
178ea73521 kill check_acl callback of generic_permission()
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:16 -04:00
Vasiliy Kulikov
1d1221f375 proc: restrict access to /proc/PID/io
/proc/PID/io may be used for gathering private information.  E.g.  for
openssh and vsftpd daemons wchars/rchars may be used to learn the
precise password length.  Restrict it to processes being able to ptrace
the target process.

ptrace_may_access() is needed to prevent keeping open file descriptor of
"io" file, executing setuid binary and gathering io information of the
setuid'ed process.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-28 09:39:11 -07:00
Tejun Heo
06d984737b ptrace: s/tracehook_tracer_task()/ptrace_parent()/
tracehook.h is on the way out.  Rename tracehook_tracer_task() to
ptrace_parent() and move it from tracehook.h to ptrace.h.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2011-06-22 19:26:29 +02:00
Al Viro
1aec7036d0 proc_sys_permission() is OK in RCU mode
nothing blocking there, since all instances of sysctl
->permissions() method are non-blocking - both of them,
that is.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-20 10:45:25 -04:00
Al Viro
cf12791116 proc_fd_permission() is doesn't need to bail out in RCU mode
nothing blocking except generic_permission()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-20 10:44:50 -04:00
Linus Torvalds
8b97b21e0f Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
  proc: Fix Oops on stat of /proc/<zombie pid>/ns/net
2011-06-16 15:02:20 -07:00
Eric W. Biederman
793925334f proc: Fix Oops on stat of /proc/<zombie pid>/ns/net
Don't call iput with the inode half setup to be a namespace filedescriptor.
Instead rearrange the code so that we don't initialize ei->ns_ops until
after I ns_ops->get succeeds, preventing us from invoking ns_ops->put
when ns_ops->get failed.

Reported-by: Ingo Saitz <Ingo.Saitz@stud.uni-hannover.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-06-15 14:35:29 -07:00
Al Viro
ff78fca2a0 fix leak in proc_set_super()
set_anon_super() can fail...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-12 17:45:28 -04:00
Linus Torvalds
57ed609d4b Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
  arch/tile: more /proc and /sys file support
2011-05-29 11:29:28 -07:00
Chris Metcalf
f133ecca9c arch/tile: more /proc and /sys file support
This change introduces a few of the less controversial /proc and
/proc/sys interfaces for tile, along with sysfs attributes for
various things that were originally proposed as /proc/tile files.
It also adjusts the "hardwall" proc API.

Arnd Bergmann reviewed the initial arch/tile submission, which
included a complete set of all the /proc/tile and /proc/sys/tile
knobs that we had added in a somewhat ad hoc way during initial
development, and provided feedback on where most of them should go.

One knob turned out to be similar enough to the existing
/proc/sys/debug/exception-trace that it was re-implemented to use
that model instead.

Another knob was /proc/tile/grid, which reported the "grid" dimensions
of a tile chip (e.g. 8x8 processors = 64-core chip).  Arnd suggested
looking at sysfs for that, so this change moves that information
to a pair of sysfs attributes (chip_width and chip_height) in the
/sys/devices/system/cpu directory.  We also put the "chip_serial"
and "chip_revision" information from our old /proc/tile/board file
as attributes in /sys/devices/system/cpu.

Other information collected via hypervisor APIs is now placed in
/sys/hypervisor.  We create a /sys/hypervisor/type file (holding the
constant string "tilera") to be parallel with the Xen use of
/sys/hypervisor/type holding "xen".  We create three top-level files,
"version" (the hypervisor's own version), "config_version" (the
version of the configuration file), and "hvconfig" (the contents of
the configuration file).  The remaining information from our old
/proc/tile/board and /proc/tile/switch files becomes an attribute
group appearing under /sys/hypervisor/board/.

Finally, after some feedback from Arnd Bergmann for the previous
version of this patch, the /proc/tile/hardwall file is split up into
two conceptual parts.  First, a directory /proc/tile/hardwall/ which
contains one file per active hardwall, each file named after the
hardwall's ID and holding a cpulist that says which cpus are enclosed by
the hardwall.  Second, a /proc/PID file "hardwall" that is either
empty (for non-hardwall-using processes) or contains the hardwall ID.

Finally, this change pushes the /proc/sys/tile/unaligned_fixup/
directory, with knobs controlling the kernel code for handling the
fixup of unaligned exceptions.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
2011-05-27 10:39:05 -04:00
Olaf Hering
997c136f51 fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages
The balloon driver in a Xen guest frees guest pages and marks them as
mmio.  When the kernel crashes and the crash kernel attempts to read the
oldmem via /proc/vmcore a read from ballooned pages will generate 100%
load in dom0 because Xen asks qemu-dm for the page content.  Since the
reads come in as 8byte requests each ballooned page is tried 512 times.

With this change a hook can be registered which checks wether the given
pfn is really ram.  The hook has to return a value > 0 for ram pages, a
value < 0 on error (because the hypercall is not known) and 0 for non-ram
pages.

This will reduce the time to read /proc/vmcore.  Without this change a
512M guest with 128M crashkernel region needs 200 seconds to read it, with
this change it takes just 2 seconds.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
KOSAKI Motohiro
98bc93e505 proc: fix pagemap_read() error case
Currently, pagemap_read() has three error and/or corner case handling
mistake.

 (1) If ppos parameter is wrong, mm refcount will be leak.
 (2) If count parameter is 0, mm refcount will be leak too.
 (3) If the current task is sleeping in kmalloc() and the system
     is out of memory and oom-killer kill the proc associated task,
     mm_refcount prevent the task free its memory. then system may
     hang up.

<Quote Hugh's explain why we shold call kmalloc() before get_mm()>

  check_mem_permission gets a reference to the mm.  If we
  __get_free_page after check_mem_permission, imagine what happens if the
  system is out of memory, and the mm we're looking at is selected for
  killing by the OOM killer: while we wait in __get_free_page for more
  memory, no memory is freed from the selected mm because it cannot reach
  exit_mmap while we hold that reference.

This patch fixes the above three.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jovi Zhang <bookjovi@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Stephen Wilson <wilsons@start.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
KOSAKI Motohiro
30cd890391 proc: put check_mem_permission after __get_free_page in mem_write
It whould be better if put check_mem_permission after __get_free_page in
mem_write, to be same as function mem_read.

Hugh Dickins explained the reason.

    check_mem_permission gets a reference to the mm.  If we __get_free_page
    after check_mem_permission, imagine what happens if the system is out
    of memory, and the mm we're looking at is selected for killing by the
    OOM killer: while we wait in __get_free_page for more memory, no memory
    is freed from the selected mm because it cannot reach exit_mmap while
    we hold that reference.

Reported-by: Jovi Zhang <bookjovi@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Stephen Wilson <wilsons@start.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
Yuanhan Liu
a4dbf0ec2a proc/stat: use defined macro KMALLOC_MAX_SIZE
There is a macro for the max size kmalloc can allocate, so use it instead
of a hardcoded number.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
Mike Frysinger
e130aa70f4 proc: constify status array
No need for this local array to be writable, so mark it const.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:36 -07:00
Alexey Dobriyan
0a8cb8e341 fs/proc: convert to kstrtoX()
Convert fs/proc/ from strict_strto*() to kstrto*() functions.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:36 -07:00
Jiri Slaby
3864601387 mm: extract exe_file handling from procfs
Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc/<pid>/exe.  Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.

To achieve that move that out of proc FS to the kernel/ where in fact it
should belong.  By doing that we can make dup_mm_exe_file static.  Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:36 -07:00
KOSAKI Motohiro
ca16d140af mm: don't access vm_flags as 'int'
The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
[ Changed to use a typedef - we'll extend it to cover more cases
  later, since there has been discussion about making it a 64-bit
  type..                      - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 09:20:31 -07:00
Linus Torvalds
14d74e0cab Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
  net: fix get_net_ns_by_fd for !CONFIG_NET_NS
  ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
  ns: Declare sys_setns in syscalls.h
  net: Allow setting the network namespace by fd
  ns proc: Add support for the ipc namespace
  ns proc: Add support for the uts namespace
  ns proc: Add support for the network namespace.
  ns: Introduce the setns syscall
  ns: proc files for namespace naming policy.
2011-05-25 18:10:16 -07:00
Linus Torvalds
3f5785ec31 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (89 commits)
  bonding: documentation and code cleanup for resend_igmp
  bonding: prevent deadlock on slave store with alb mode (v3)
  net: hold rtnl again in dump callbacks
  Add Fujitsu 1000base-SX PCI ID to tg3
  bnx2x: protect sequence increment with mutex
  sch_sfq: fix peek() implementation
  isdn: netjet - blacklist Digium TDM400P
  via-velocity: don't annotate MAC registers as packed
  xen: netfront: hold RTNL when updating features.
  sctp: fix memory leak of the ASCONF queue when free asoc
  net: make dev_disable_lro use physical device if passed a vlan dev (v2)
  net: move is_vlan_dev into public header file (v2)
  bug.h: Fix build with CONFIG_PRINTK disabled.
  wireless: fix fatal kernel-doc error + warning in mac80211.h
  wireless: fix cfg80211.h new kernel-doc warnings
  iwlagn: dbg_fixed_rate only used when CONFIG_MAC80211_DEBUGFS enabled
  dst: catch uninitialized metrics
  be2net: hash key for rss-config cmd not set
  bridge: initialize fake_rtable metrics
  net: fix __dst_destroy_metrics_generic()
  ...

Fix up trivial conflicts in drivers/staging/brcm80211/brcmfmac/wl_cfg80211.c
2011-05-25 17:00:17 -07:00
Stephen Wilson
5b52fc890b proc: allocate storage for numa_maps statistics once
In show_numa_map() we collect statistics into a numa_maps structure.
Since the number of NUMA nodes can be very large, this structure is not a
candidate for stack allocation.

Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
is invoked, perform the allocation just once when /proc/pid/numa_maps is
opened.

Performing the allocation when numa_maps is opened, and thus before a
reference to the target tasks mm is taken, eliminates a potential
stalemate condition in the oom-killer as originally described by Hugh
Dickins:

  ... imagine what happens if the system is out of memory, and the mm
  we're looking at is selected for killing by the OOM killer: while
  we wait in __get_free_page for more memory, no memory is freed
  from the selected mm because it cannot reach exit_mmap while we hold
  that reference.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:35 -07:00
Stephen Wilson
f2beb79836 proc: make struct proc_maps_private truly private
Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
there is no need to export struct proc_maps_private to the world.  Move it
to fs/proc/internal.h instead.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:35 -07:00
Stephen Wilson
f69ff943df mm: proc: move show_numa_map() to fs/proc/task_mmu.c
Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
issues.

  - Having the show() operation "miles away" from the corresponding
    seq_file iteration operations is a maintenance burden.

  - The need to export ad hoc info like struct proc_maps_private is
    eliminated.

  - The implementation of show_numa_map() can be improved in a simple
    manner by cooperating with the other seq_file operations (start,
    stop, etc) -- something that would be messy to do without this
    change.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:34 -07:00
Eric W. Biederman
62ca24baf1 ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
Spotted-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-05-24 15:30:33 -07:00
John W. Linville
31ec97d9ce Merge ssh://master.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem 2011-05-24 16:47:54 -04:00
Alexey Dobriyan
011159a0a7 airo: correct proc entry creation interfaces
* use proc_mkdir_mode() instead of create_proc_entry(S_IFDIR|...),
  export proc_mkdir_mode() for that, oh well.
* don't supply S_IFREG to proc_create_data(), it's unnecessary

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-05-16 14:25:28 -04:00
Eric W. Biederman
a00eaf11a2 ns proc: Add support for the ipc namespace
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-05-10 14:35:47 -07:00
Eric W. Biederman
34482e89a5 ns proc: Add support for the uts namespace
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-05-10 14:35:35 -07:00
Eric W. Biederman
13b6f57623 ns proc: Add support for the network namespace.
Implementing file descriptors for the network namespace
is simple and straight forward.

Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-05-10 14:34:26 -07:00
Eric W. Biederman
6b4e306aa3 ns: proc files for namespace naming policy.
Create files under /proc/<pid>/ns/ to allow controlling the
namespaces of a process.

This addresses three specific problems that can make namespaces hard to
work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the child
  of the original creator.
- Namespaces don't have names that userspace can use to talk about
  them.

The namespace files under /proc/<pid>/ns/ can be opened and the
file descriptor can be used to talk about a specific namespace, and
to keep the specified namespace alive.

A namespace can be kept alive by either holding the file descriptor
open or bind mounting the file someplace else.  aka:
mount --bind /proc/self/ns/net /some/filesystem/path
mount --bind /proc/self/fd/<N> /some/filesystem/path

This allows namespaces to be named with userspace policy.

It requires additional support to make use of these filedescriptors
and that will be comming in the following patches.

Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-05-10 14:31:44 -07:00
Mikulas Patocka
a09a79f668 Don't lock guardpage if the stack is growing up
Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.

[ Patch changed fairly extensively to also fix /proc/<pid>/maps for the
  grows-up case, and to move things around a bit to clean it all up and
  share the infrstructure with the /proc bits.

  Tested on ia64 that has both grow-up and grow-down segments  - Linus ]

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 16:22:07 -07:00
Linus Torvalds
d8bdc59f21 proc: do proper range check on readdir offset
Rather than pass in some random truncated offset to the pid-related
functions, check that the offset is in range up-front.

This is just cleanup, the previous commit fixed the real problem.

Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-18 10:36:54 -07:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Linus Torvalds
76597cd314 proc: fix oops on invalid /proc/<pid>/maps access
When m_start returns an error, the seq_file logic will still call m_stop
with that error entry, so we'd better make sure that we check it before
using it as a vma.

Introduced by commit ec6fd8a435 ("report errors in /proc/*/*map*
sanely"), which replaced NULL with various ERR_PTR() cases.

(On ia64, you happen to get a unaligned fault instead of a page fault,
since the address used is generally some random error code like -EPERM)

Reported-by: Anca Emanuel <anca.emanuel@gmail.com>
Reported-by: Tony Luck <tony.luck@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Wilson <wilsons@start.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-27 19:09:29 -07:00
Linus Torvalds
b81a618dcd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  deal with races in /proc/*/{syscall,stack,personality}
  proc: enable writing to /proc/pid/mem
  proc: make check_mem_permission() return an mm_struct on success
  proc: hold cred_guard_mutex in check_mem_permission()
  proc: disable mem_write after exec
  mm: implement access_remote_vm
  mm: factor out main logic of access_process_vm
  mm: use mm_struct to resolve gate vma's in __get_user_pages
  mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
  mm: arch: make in_gate_area take an mm_struct instead of a task_struct
  mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
  x86: mark associated mm when running a task in 32 bit compatibility mode
  x86: add context tag to mark mm when running a task in 32-bit compatibility mode
  auxv: require the target to be tracable (or yourself)
  close race in /proc/*/environ
  report errors in /proc/*/*map* sanely
  pagemap: close races with suid execve
  make sessionid permissions in /proc/*/task/* match those in /proc/*
  fix leaks in path_lookupat()

Fix up trivial conflicts in fs/proc/base.c
2011-03-23 20:51:42 -07:00
Oleg Nesterov
52e9fc76d0 procfs: kill the global proc_mnt variable
After the previous cleanup in proc_get_sb() the global proc_mnt has no
reasons to exists, kill it.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:58 -07:00
Eric W. Biederman
4308eebbeb pidns: call pid_ns_prepare_proc() from create_pid_namespace()
Reorganize proc_get_sb() so it can be called before the struct pid of the
first process is allocated.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:58 -07:00
Kees Cook
5883f57ca0 proc: protect mm start_code/end_code in /proc/pid/stat
While mm->start_stack was protected from cross-uid viewing (commit
f83ce3e6b0 ("proc: avoid information leaks to non-privileged
processes")), the start_code and end_code values were not.  This would
allow the text location of a PIE binary to leak, defeating ASLR.

Note that the value "1" is used instead of "0" for a protected value since
"ps", "killall", and likely other readers of /proc/pid/stat, take
start_code of "0" to mean a kernel thread and will misbehave.  Thanks to
Brad Spengler for pointing this out.

Addresses CVE-2011-0726

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Cc: <stable@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:37 -07:00
Alexey Dobriyan
312ec7e50c proc: make struct proc_dir_entry::namelen unsigned int
1. namelen is declared "unsigned short" which hints for "maybe space savings".
   Indeed in 2.4 struct proc_dir_entry looked like:

        struct proc_dir_entry {
                unsigned short low_ino;
                unsigned short namelen;

   Now, low_ino is "unsigned int", all savings were gone for a long time.
   "struct proc_dir_entry" is not that countless to worry about it's size,
   anyway.

2. converting from unsigned short to int/unsigned int can only create
   problems, we better play it safe.

Space is not really conserved, because of natural alignment for the next
field.  sizeof(struct proc_dir_entry) remains the same.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:37 -07:00
Jovi Zhang
fc3d8767b2 procfs: fix some wrong error code usage
[root@wei 1]# cat /proc/1/mem
cat: /proc/1/mem: No such process

error code -ESRCH is wrong in this situation.  Return -EPERM instead.

Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:36 -07:00
Aaro Koskinen
0db0c01b53 procfs: fix /proc/<pid>/maps heap check
The current code fails to print the "[heap]" marking if the heap is split
into multiple mappings.

Fix the check so that the marking is displayed in all possible cases:
	1. vma matches exactly the heap
	2. the heap vma is merged e.g. with bss
	3. the heap vma is splitted e.g. due to locked pages

Test cases. In all cases, the process should have mapping(s) with
[heap] marking:

	(1) vma matches exactly the heap

	#include <stdio.h>
	#include <unistd.h>
	#include <sys/types.h>

	int main (void)
	{
		if (sbrk(4096) != (void *)-1) {
			printf("check /proc/%d/maps\n", (int)getpid());
			while (1)
				sleep(1);
		}
		return 0;
	}

	# ./test1
	check /proc/553/maps
	[1] + Stopped                    ./test1
	# cat /proc/553/maps | head -4
	00008000-00009000 r-xp 00000000 01:00 3113640    /test1
	00010000-00011000 rw-p 00000000 01:00 3113640    /test1
	00011000-00012000 rw-p 00000000 00:00 0          [heap]
	4006f000-40070000 rw-p 00000000 00:00 0

	(2) the heap vma is merged

	#include <stdio.h>
	#include <unistd.h>
	#include <sys/types.h>

	char foo[4096] = "foo";
	char bar[4096];

	int main (void)
	{
		if (sbrk(4096) != (void *)-1) {
			printf("check /proc/%d/maps\n", (int)getpid());
			while (1)
				sleep(1);
		}
		return 0;
	}

	# ./test2
	check /proc/556/maps
	[2] + Stopped                    ./test2
	# cat /proc/556/maps | head -4
	00008000-00009000 r-xp 00000000 01:00 3116312    /test2
	00010000-00012000 rw-p 00000000 01:00 3116312    /test2
	00012000-00014000 rw-p 00000000 00:00 0          [heap]
	4004a000-4004b000 rw-p 00000000 00:00 0

	(3) the heap vma is splitted (this fails without the patch)

	#include <stdio.h>
	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/types.h>

	int main (void)
	{
		if ((sbrk(4096) != (void *)-1) && !mlockall(MCL_FUTURE) &&
		    (sbrk(4096) != (void *)-1)) {
			printf("check /proc/%d/maps\n", (int)getpid());
			while (1)
				sleep(1);
		}
		return 0;
	}

	# ./test3
	check /proc/559/maps
	[1] + Stopped                    ./test3
	# cat /proc/559/maps|head -4
	00008000-00009000 r-xp 00000000 01:00 3119108    /test3
	00010000-00011000 rw-p 00000000 01:00 3119108    /test3
	00011000-00012000 rw-p 00000000 00:00 0          [heap]
	00012000-00013000 rw-p 00000000 00:00 0          [heap]

It looks like the bug has been there forever, and since it only results in
some information missing from a procfile, it does not fulfil the -stable
"critical issue" criteria.

Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:36 -07:00
Konstantin Khlebnikov
51e031496d proc: hide kernel addresses via %pK in /proc/<pid>/stack
This file is readable for the task owner.  Hide kernel addresses from
unprivileged users, leave them function names and offsets.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Kees Cook <kees.cook@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:36 -07:00
Al Viro
a9712bc12c deal with races in /proc/*/{syscall,stack,personality}
All of those are rw-r--r-- and all are broken for suid - if you open
a file before the target does suid-root exec, you'll be still able
to access it.  For personality it's not a big deal, but for syscall
and stack it's a real problem.

Fix: check that task is tracable for you at the time of read().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 17:01:18 -04:00
Stephen Wilson
198214a7ee proc: enable writing to /proc/pid/mem
With recent changes there is no longer a security hazard with writing to
/proc/pid/mem.  Remove the #ifdef.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:59 -04:00
Stephen Wilson
8b0db9db19 proc: make check_mem_permission() return an mm_struct on success
This change allows us to take advantage of access_remote_vm(), which in turn
eliminates a security issue with the mem_write() implementation.

The previous implementation of mem_write() was insecure since the target task
could exec a setuid-root binary between the permission check and the actual
write.  Holding a reference to the target mm_struct eliminates this
vulnerability.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:59 -04:00
Stephen Wilson
18f661bcf8 proc: hold cred_guard_mutex in check_mem_permission()
Avoid a potential race when task exec's and we get a new ->mm but check against
the old credentials in ptrace_may_access().

Holding of the mutex is implemented by factoring out the body of the code into a
helper function __check_mem_permission().  Performing this factorization now
simplifies upcoming changes and minimizes churn in the diff's.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:58 -04:00
Stephen Wilson
26947f8c8f proc: disable mem_write after exec
This change makes mem_write() observe the same constraints as mem_read().  This
is particularly important for mem_write as an accidental leak of the fd across
an exec could result in arbitrary modification of the target process' memory.
IOW, /proc/pid/mem is implicitly close-on-exec.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:58 -04:00
Stephen Wilson
31db58b3ab mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
Morally, the presence of a gate vma is more an attribute of a particular mm than
a particular task.  Moreover, dropping the dependency on task_struct will help
make both existing and future operations on mm's more flexible and convenient.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:54 -04:00
Al Viro
2fadaef412 auxv: require the target to be tracable (or yourself)
same as for environ, except that we didn't do any checks to
prevent access after suid execve

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:52 -04:00
Al Viro
d6f64b89d7 close race in /proc/*/environ
Switch to mm_for_maps().  Maybe we ought to make it r--r--r--,
since we do checks on IO anyway...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:51 -04:00
Al Viro
ec6fd8a435 report errors in /proc/*/*map* sanely
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:50 -04:00
Al Viro
ca6b0bf0e0 pagemap: close races with suid execve
just use mm_for_maps()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:50 -04:00
Al Viro
26ec3c646e make sessionid permissions in /proc/*/task/* match those in /proc/*
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:49 -04:00
Dave Hansen
4031a219d8 smaps: have smaps show transparent huge pages
Now that the mere act of _looking_ at /proc/$pid/smaps will not destroy
transparent huge pages, tell how much of the VMA is actually mapped with
them.

This way, we can make sure that we're getting THPs where we
expect to see them.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Cc: Michael J Wolf <mjwolf@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Dave Hansen
22e057c592 smaps: teach smaps_pte_range() about THP pmds
This adds code to explicitly detect and handle pmd_trans_huge() pmds.  It
then passes HPAGE_SIZE units in to the smap_pte_entry() function instead
of PAGE_SIZE.

This means that using /proc/$pid/smaps now will no longer cause THPs to be
broken down in to small pages.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michael J Wolf <mjwolf@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Dave Hansen
3c9acc7849 smaps: pass pte size argument in to smaps_pte_entry()
Add an argument to the new smaps_pte_entry() function to let it account in
things other than PAGE_SIZE units.  I changed all of the PAGE_SIZE sites,
even though not all of them can be reached for transparent huge pages,
just so this will continue to work without changes as THPs are improved.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Cc: Michael J Wolf <mjwolf@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Dave Hansen
ae11c4d9f6 smaps: break out smaps_pte_entry() from smaps_pte_range()
We will use smaps_pte_entry() in a moment to handle both small and
transparent large pages.  But, we must break it out of smaps_pte_range()
first.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Cc: Michael J Wolf <mjwolf@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Dave Hansen
033193275b pagewalk: only split huge pages when necessary
Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
unconditionally split any transparent huge pages it runs in to.  In
practice, that means that anyone doing a

	cat /proc/$pid/smaps

will unconditionally break down every huge page in the process and depend
on khugepaged to re-collapse it later.  This is fairly suboptimal.

This patch changes that behavior.  It teaches each ->pmd_entry handler
(there are five) that they must break down the THPs themselves.  Also, the
_generic_ code will never break down a THP unless a ->pte_entry handler is
actually set.

This means that the ->pmd_entry handlers can now choose to deal with THPs
without breaking them down.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Cc: Michael J Wolf <mjwolf@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
James Morris
a002951c97 Merge branch 'next' into for-linus 2011-03-16 09:41:17 +11:00
Al Viro
ae50adcb0a /proc/self is never going to be invalidated...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10 03:41:53 -05:00
Al Viro
dfef6dcd35 unfuck proc_sysctl ->d_compare()
a) struct inode is not going to be freed under ->d_compare();
however, the thing PROC_I(inode)->sysctl points to just might.
Fortunately, it's enough to make freeing that sucker delayed,
provided that we don't step on its ->unregistering, clear
the pointer to it in PROC_I(inode) before dropping the reference
and check if it's NULL in ->d_compare().

b) I'm not sure that we *can* walk into NULL inode here (we recheck
dentry->seq between verifying that it's still hashed / fetching
dentry->d_inode and passing it to ->d_compare() and there's no
negative hashed dentries in /proc/sys/*), but if we can walk into
that, we really should not have ->d_compare() return 0 on it!
Said that, I really suspect that this check can be simply killed.
Nick?

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-08 02:22:27 -05:00
James Morris
fe3fa43039 Merge branch 'master' of git://git.infradead.org/users/eparis/selinux into next 2011-03-08 11:38:10 +11:00
Paul Bolle
8aaccf7fa2 of/flattree: Drop an uninteresting message to pr_debug level
This message looks like an error (which it isn't) when booting with a
flattened device tree.  Remove the message from normal kernel builds.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-03-02 13:45:18 -07:00
Martin Schwidefsky
261cd298a8 s390: remove task_show_regs
task_show_regs used to be a debugging aid in the early bringup days
of Linux on s390. /proc/<pid>/status is a world readable file, it
is not a good idea to show the registers of a process. The only
correct fix is to remove task_show_regs.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-15 07:34:16 -08:00
Lucian Adrian Grijincu
8e6c96935f security/selinux: fix /proc/sys/ labeling
This fixes an old (2007) selinux regression: filesystem labeling for
/proc/sys returned
     -r--r--r-- unknown                          /proc/sys/fs/file-nr
instead of
     -r--r--r-- system_u:object_r:sysctl_fs_t:s0 /proc/sys/fs/file-nr

Events that lead to breaking of /proc/sys/ selinux labeling:

1) sysctl was reimplemented to route all calls through /proc/sys/

    commit 77b14db502
    [PATCH] sysctl: reimplement the sysctl proc support

2) proc_dir_entry was removed from ctl_table:

    commit 3fbfa98112
    [PATCH] sysctl: remove the proc_dir_entry member for the sysctl tables

3) selinux still walked the proc_dir_entry tree to apply
   labeling. Because ctl_tables don't have a proc_dir_entry, we did
   not label /proc/sys/ inodes any more. To achieve this the /proc/sys/
   inodes were marked private and private inodes were ignored by
   selinux.

    commit bbaca6c2e7
    [PATCH] selinux: enhance selinux to always ignore private inodes

    commit 86a71dbd3e
    [PATCH] sysctl: hide the sysctl proc inodes from selinux

Access control checks have been done by means of a special sysctl hook
that was called for read/write accesses to any /proc/sys/ entry.

We don't have to do this because, instead of walking the
proc_dir_entry tree we can walk the dentry tree (as done in this
patch). With this patch:
* we don't mark /proc/sys/ inodes as private
* we don't need the sysclt security hook
* we walk the dentry tree to find the path to the inode.

We have to strip the PID in /proc/PID/ entries that have a
proc_dir_entry because selinux does not know how to label paths like
'/1/net/rpc/nfsd.fh' (and defaults to 'proc_t' labeling). Selinux does
know of '/net/rpc/nfsd.fh' (and applies the 'sysctl_rpc_t' label).

PID stripping from the path was done implicitly in the previous code
because the proc_dir_entry tree had the root in '/net' in the example
from above. The dentry tree has the root in '/1'.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2011-02-01 11:53:54 -05:00
Torben Hohn
ac751efa6a console: rename acquire/release_console_sem() to console_lock/unlock()
The -rt patches change the console_semaphore to console_mutex.  As a
result, a quite large chunk of the patches changes all
acquire/release_console_sem() to acquire/release_console_mutex()

This commit makes things use more neutral function names which dont make
implications about the underlying lock.

The only real change is the return value of console_trylock which is
inverted from try_acquire_console_sem()

This patch also paves the way to switching console_sem from a semaphore to
a mutex.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make console_trylock return 1 on success, per Geert]
Signed-off-by: Torben Hohn <torbenh@gmx.de>
Cc: Thomas Gleixner <tglx@tglx.de>
Cc: Greg KH <gregkh@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:06 +10:00
David Rientjes
6a108a14fa kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT
The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
is used to configure any non-standard kernel with a much larger scope than
only small devices.

This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
references to the option throughout the kernel.  A new CONFIG_EMBEDDED
option is added that automatically selects CONFIG_EXPERT when enabled and
can be used in the future to isolate options that should only be
considered for embedded systems (RISC architectures, SLOB, etc).

Calling the option "EXPERT" more accurately represents its intention: only
expert users who understand the impact of the configuration changes they
are making should enable it.

Reviewed-by: Ingo Molnar <mingo@elte.hu>
Acked-by: David Woodhouse <david.woodhouse@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Greg KH <gregkh@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Robin Holt <holt@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:05 -08:00
Andrea Arcangeli
5f24ce5fd3 thp: remove PG_buddy
PG_buddy can be converted to _mapcount == -2.  So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y.  This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes.  We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Andrea Arcangeli
79134171df thp: transparent hugepage vmstat
Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Mandeep Singh Baines
dabb16f639 oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down
We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground.  Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE.  This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork.  Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex.  The alternatives also
have much higher overhead.

This patch updated from original patch based on feedback from David
Rientjes.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:35 -08:00
Nikanth Karthikesan
2d90508f63 mm: smaps: export mlock information
Currently there is no way to find whether a process has locked its pages
in memory or not.  And which of the memory regions are locked in memory.

Add a new field "Locked" to export this information via the smaps file.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:33 -08:00
Dave Anderson
ceff1a7709 /proc/kcore: fix seeking
Commit 34aacb2920 ("procfs: Use generic_file_llseek in /proc/kcore") broke
seeking on /proc/kcore.  This changes it back to use default_llseek in
order to restore the original behavior.

The problem with generic_file_llseek is that it only allows seeks up to
inode->i_sb->s_maxbytes, which is 2GB-1 on procfs, where the memory file
offset values in the /proc/kcore PT_LOAD segments may exceed or start
beyond that offset value.

A similar revert was made for /proc/vmcore.

Signed-off-by: Dave Anderson <anderson@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:17 -08:00
Alexey Dobriyan
bf33cbdf8a proc: move proc_console.c to fs/proc/consoles.c
Filename is supposed to match procfile name for random junk.

Add __init while I'm at it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:17 -08:00
Alexey Dobriyan
3740a20c4f proc: less LOCK/UNLOCK in remove_proc_entry()
For the common case where a proc entry is being removed and nobody is in
the process of using it, save a LOCK/UNLOCK pair.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:17 -08:00
Petr Holasek
a6fc86d2b4 kpagecount: add slab page checking because _mapcount is in a union
Add a PageSlab() check before adding the _mapcount value to /kpagecount.
page->_mapcount is in a union with the SLAB structure so for pages
controlled by SLAB, page_mapcount() returns nonsense.

Signed-off-by: Petr Holasek <pholasek@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:17 -08:00
Jovi Zhang
c6a3405846 proc: use single_open() correctly
single_open()'s third argument is for copying into seq_file->private.  Use
that, rather than open-coding it.

Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Alexey Dobriyan
6d1b6e4eff proc: ->low_ino cleanup
- ->low_ino is write-once field -- reading it under locks is unnecessary.

- /proc/$PID stuff never reaches pde_put()/free_proc_entry() --
   PROC_DYNAMIC_FIRST check never triggers.

- in proc_get_inode(), inode number always matches proc dir entry, so
  save one parameter.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Alexey Dobriyan
9d6de12f70 proc: use seq_puts()/seq_putc() where possible
For string without format specifiers, use seq_puts().
For seq_printf("\n"), use seq_putc('\n').

   text	   data	    bss	    dec	    hex	filename
  61866	    488	    112	  62466	   f402	fs/proc/proc.o
  61729	    488	    112	  62329	   f379	fs/proc/proc.o
  ----------------------------------------------------
  			   -139

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Alexey Dobriyan
a2ade7b6ca proc: use unsigned long inside /proc/*/statm
/proc/*/statm code needlessly truncates data from unsigned long to int.
One needs only 8+ TB of RAM to make truncation visible.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Joe Perches
34e49d4f63 fs/proc/base.c, kernel/latencytop.c: convert sprintf_symbol() to %ps
Use temporary lr for struct latency_record for improved readability and
fewer columns used.  Removed trailing space from output.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Linus Torvalds
56b85f32d5 Merge branch 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (36 commits)
  serial: apbuart: Fixup apbuart_console_init()
  TTY: Add tty ioctl to figure device node of the system console.
  tty: add 'active' sysfs attribute to tty0 and console device
  drivers: serial: apbuart: Handle OF failures gracefully
  Serial: Avoid unbalanced IRQ wake disable during resume
  tty: fix typos/errors in tty_driver.h comments
  pch_uart : fix warnings for 64bit compile
  8250: fix uninitialized FIFOs
  ip2: fix compiler warning on ip2main_pci_tbl
  specialix: fix compiler warning on specialix_pci_tbl
  rocket: fix compiler warning on rocket_pci_ids
  8250: add a UPIO_DWAPB32 for 32 bit accesses
  8250: use container_of() instead of casting
  serial: omap-serial: Add support for kernel debugger
  serial: fix pch_uart kconfig & build
  drivers: char: hvc: add arm JTAG DCC console support
  RS485 documentation: add 16C950 UART description
  serial: ifx6x60: fix memory leak
  serial: ifx6x60: free IRQ on error
  Serial: EG20T: add PCH_UART driver
  ...

Fixed up conflicts in drivers/serial/apbuart.c with evil merge that
makes the code look fairly sane (unlike either side).
2011-01-07 14:39:20 -08:00
Linus Torvalds
b4a45f5fe8 Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: (57 commits)
  fs: scale mntget/mntput
  fs: rename vfsmount counter helpers
  fs: implement faster dentry memcmp
  fs: prefetch inode data in dcache lookup
  fs: improve scalability of pseudo filesystems
  fs: dcache per-inode inode alias locking
  fs: dcache per-bucket dcache hash locking
  bit_spinlock: add required includes
  kernel: add bl_list
  xfs: provide simple rcu-walk ACL implementation
  btrfs: provide simple rcu-walk ACL implementation
  ext2,3,4: provide simple rcu-walk ACL implementation
  fs: provide simple rcu-walk generic_check_acl implementation
  fs: provide rcu-walk aware permission i_ops
  fs: rcu-walk aware d_revalidate method
  fs: cache optimise dentry and inode for rcu-walk
  fs: dcache reduce branches in lookup path
  fs: dcache remove d_mounted
  fs: fs_struct use seqlock
  fs: rcu-walk for path lookup
  ...
2011-01-07 08:56:33 -08:00
Nick Piggin
b74c79e993 fs: provide rcu-walk aware permission i_ops
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin
34286d6662 fs: rcu-walk aware d_revalidate method
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin
fb045adb99 fs: dcache reduce branches in lookup path
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:28 +11:00
Nick Piggin
31e6b01f41 fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:27 +11:00
Nick Piggin
fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin
621e155a35 fs: change d_compare for rcu-walk
Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:19 +11:00
Nick Piggin
fe15ce446b fs: change d_delete semantics
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:18 +11:00
Linus Torvalds
3c0cb7c31c Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (416 commits)
  ARM: DMA: add support for DMA debugging
  ARM: PL011: add DMA burst threshold support for ST variants
  ARM: PL011: Add support for transmit DMA
  ARM: PL011: Ensure IRQs are disabled in UART interrupt handler
  ARM: PL011: Separate hardware FIFO size from TTY FIFO size
  ARM: PL011: Allow better handling of vendor data
  ARM: PL011: Ensure error flags are clear at startup
  ARM: PL011: include revision number in boot-time port printk
  ARM: vexpress: add sched_clock() for Versatile Express
  ARM i.MX53: Make MX53 EVK bootable
  ARM i.MX53: Some bug fix about MX53 MSL code
  ARM: 6607/1: sa1100: Update platform device registration
  ARM: 6606/1: sa1100: Fix platform device registration
  ARM i.MX51: rename IPU irqs
  ARM i.MX51: Add ipu clock support
  ARM: imx/mx27_3ds: Add PMIC support
  ARM: DMA: Replace page_to_dma()/dma_to_page() with pfn_to_dma()/dma_to_pfn()
  mx51: fix usb clock support
  MX51: Add support for usb host 2
  arch/arm/plat-mxc/ehci.c: fix errors/typos
  ...
2011-01-06 16:50:35 -08:00
Russell King
31edf274f9 Merge branches 'ftrace', 'gic', 'io', 'kexec', 'mod', 'sa11x0', 'sh' and 'versatile' into devel 2011-01-05 18:08:10 +00:00
Ingo Molnar
8e9255e6a2 Merge branch 'linus' into sched/core
Merge reason: we want to queue up dependent cleanup

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08 20:15:29 +01:00
Eric W. Biederman
7b2a69ba70 Revert "vfs: show unreachable paths in getcwd and proc"
Because it caused a chroot ttyname regression in 2.6.36.

As of 2.6.36 ttyname does not work in a chroot.  It has already been
reported that screen breaks, and for me this breaks an automated
distribution testsuite, that I need to preserve the ability to run the
existing binaries on for several more years.  glibc 2.11.3 which has a
fix for this is not an option.

The root cause of this breakage is:

    commit 8df9d1a414
    Author: Miklos Szeredi <mszeredi@suse.cz>
    Date:   Tue Aug 10 11:41:41 2010 +0200

    vfs: show unreachable paths in getcwd and proc

    Prepend "(unreachable)" to path strings if the path is not reachable
    from the current root.

    Two places updated are
     - the return string from getcwd()
     - and symlinks under /proc/$PID.

    Other uses of d_path() are left unchanged (we know that some old
    software crashes if /proc/mounts is changed).

    Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

So remove the nice sounding, but ultimately ill advised change to how
/proc/fd symlinks work.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-05 16:39:45 -08:00
Mike Galbraith
5091faa449 sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity.  This patch
implements an idea from Linus, to automatically create task
groups.  Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.

Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group.  When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped.  Child
processes inherit this task group thereafter, and increase it's
refcount.  When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.

At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:

  cat /proc/<pid>/autogroup

Displays the task's group and the group's nice level.

  echo <nice level> > /proc/<pid>/autogroup

Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.

The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:

  echo [01] > /proc/sys/kernel/sched_autogroup_enabled

... which will automatically move tasks to/from the root task group.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:03:35 +01:00
Mika Westerberg
9833c39400 ARM: 6485/5: proc/vmcore - allow archs to override vmcore_elf_check_arch()
Allow architectures to redefine this macro if needed. This is useful for
example in architectures where 64-bit ELF vmcores are not supported.
Specifying zero vmcore_elf64_check_arch() allows compiler to optimize
away unnecessary parts of parse_crash_elf64_headers().

We also rename the macro to vmcore_elf64_check_arch() to reflect that it
is used for 64-bit vmcores only.

Signed-off-by: Mika Westerberg <mika.westerberg@iki.fi>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-11-30 13:39:55 +00:00
Naoya Horiguchi
ea251c1d5c pagemap: set pagemap walk limit to PMD boundary
Currently one pagemap_read() call walks in PAGEMAP_WALK_SIZE bytes (== 512
pages.) But there is a corner case where walk_pmd_range() accidentally
runs over a VMA associated with a hugetlbfs file.

For example, when a process has mappings to VMAs as shown below:

  # cat /proc/<pid>/maps
  ...
  3a58f6d000-3a58f72000 rw-p 00000000 00:00 0
  7fbd51853000-7fbd51855000 rw-p 00000000 00:00 0
  7fbd5186c000-7fbd5186e000 rw-p 00000000 00:00 0
  7fbd51a00000-7fbd51c00000 rw-s 00000000 00:12 8614   /hugepages/test

then pagemap_read() goes into walk_pmd_range() path and walks in the range
0x7fbd51853000-0x7fbd51a53000, but the hugetlbfs VMA should be handled by
walk_hugetlb_range().  Otherwise PMD for the hugepage is considered bad
and cleared, which causes undesirable results.

This patch fixes it by separating pagemap walk range into one PMD.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:46 +09:00
Arnd Bergmann
451a3c24b0 BKL: remove extraneous #include <smp_lock.h>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-17 08:59:32 -08:00
Jiri Slaby
23308ba54d console: add /proc/consoles
It allows users to see what consoles are currently known to the system
and with what flags.

It is based on Werner's patch, the part about traversing fds was
removed, the code was moved to kernel/printk.c, where consoles are
handled and it makes more sense to me.

Signed-off-by: Jiri Slaby <jslaby@suse.cz> [cleanups]
Signed-off-by: "Dr. Werner Fink" <werner@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-11-16 12:50:17 -08:00
Al Viro
aed1d84f98 switch procfs to ->mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:01 -04:00
Al Viro
579441a39b setting ->proc_mnt doesn't belong in proc_get_sb()
take that to kern_mount_data()-using callers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:58 -04:00
KAMEZAWA Hiroyuki
478735e388 /proc/stat: fix scalability of irq sum of all cpu
In /proc/stat, the number of per-IRQ event is shown by making a sum each
irq's events on all cpus.  But we can make use of kstat_irqs().

kstat_irqs() do the same calculation, If !CONFIG_GENERIC_HARDIRQ,
it's not a big cost. (Both of the number of cpus and irqs are small.)

If a system is very big and CONFIG_GENERIC_HARDIRQ, it does

	for_each_irq()
		for_each_cpu()
			- look up a radix tree
			- read desc->irq_stat[cpu]
This seems not efficient. This patch adds kstat_irqs() for
CONFIG_GENRIC_HARDIRQ and change the calculation as

	for_each_irq()
		look up radix tree
		for_each_cpu()
			- read desc->irq_stat[cpu]

This reduces cost.

A test on (4096cpusp, 256 nodes, 4592 irqs) host (by Jack Steiner)

%time cat /proc/stat > /dev/null

Before Patch:	 2.459 sec
After Patch :	  .561 sec

[akpm@linux-foundation.org: unexport kstat_irqs, coding-style tweaks]
[akpm@linux-foundation.org: fix unused variable 'per_irq_sum']
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:13 -07:00
KAMEZAWA Hiroyuki
f2c66cd8ee /proc/stat: scalability of irq num per cpu
/proc/stat shows the total number of all interrupts to each cpu.  But when
the number of IRQs are very large, it take very long time and 'cat
/proc/stat' takes more than 10 secs.  This is because sum of all irq
events are counted when /proc/stat is read.  This patch adds "sum of all
irq" counter percpu and reduce read costs.

The cost of reading /proc/stat is important because it's used by major
applications as 'top', 'ps', 'w', etc....

A test on a mechin (4096cpu, 256 nodes, 4592 irqs) shows

 %time cat /proc/stat > /dev/null
 Before Patch:  12.627 sec
 After  Patch:  2.459 sec

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:13 -07:00
Davidlohr Bueso
19cd56c48d procfs: fix /proc/softirqs formatting
The length of the BLOCK_IPOLL string is making i's value be printed too
far to the right.  This patch fixes this and makes the output a bit
neater.

Currently:
                CPU0
      HI:          0
   TIMER:     599792
  NET_TX:          2
  NET_RX:          6
   BLOCK:      80807
BLOCK_IOPOLL:          0
 TASKLET:      20012
   SCHED:          0
 HRTIMER:         63
     RCU:     619279

With patch:
                    CPU0
          HI:          0
       TIMER:     585582
      NET_TX:          2
      NET_RX:          6
       BLOCK:      80320
BLOCK_IOPOLL:          0
     TASKLET:      19287
       SCHED:          0
     HRTIMER:         62
         RCU:     604441

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:13 -07:00
Nikanth Karthikesan
b40d4f84be /proc/pid/smaps: export amount of anonymous memory in a mapping
Export the number of anonymous pages in a mapping via smaps.

Even the private pages in a mapping backed by a file, would be marked as
anonymous, when they are modified. Export this information to user-space via
smaps.

Exporting this count will help gdb to make a better decision on which
areas need to be dumped in its coredump; and should be useful to others
studying the memory usage of a process.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:13 -07:00
KOSAKI Motohiro
9b1bf12d5d signals: move cred_guard_mutex from task_struct to signal_struct
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it.  Yes, concurrent
execve() has no worth.

Let's move ->cred_guard_mutex from task_struct to signal_struct.  It
naturally prevent multiple-threads-inside-exec.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:12 -07:00
Linus Torvalds
426e1f5cec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
  split invalidate_inodes()
  fs: skip I_FREEING inodes in writeback_sb_inodes
  fs: fold invalidate_list into invalidate_inodes
  fs: do not drop inode_lock in dispose_list
  fs: inode split IO and LRU lists
  fs: switch bdev inode bdi's correctly
  fs: fix buffer invalidation in invalidate_list
  fsnotify: use dget_parent
  smbfs: use dget_parent
  exportfs: use dget_parent
  fs: use RCU read side protection in d_validate
  fs: clean up dentry lru modification
  fs: split __shrink_dcache_sb
  fs: improve DCACHE_REFERENCED usage
  fs: use percpu counter for nr_dentry and nr_dentry_unused
  fs: simplify __d_free
  fs: take dcache_lock inside __d_path
  fs: do not assign default i_ino in new_inode
  fs: introduce a per-cpu last_ino allocator
  new helper: ihold()
  ...
2010-10-26 17:58:44 -07:00
David Rientjes
d19d5476f4 oom: fix locking for oom_adj and oom_score_adj
The locking order in oom_adjust_write() and oom_score_adj_write() for
task->alloc_lock and task->sighand->siglock is reversed, and lockdep
notices that irqs could encounter an ABBA scenario.

This fixes the locking order so that we always take task_lock(task) prior
to lock_task_sighand(task).

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
David Rientjes
723548bff1 oom: rewrite error handling for oom_adj and oom_score_adj tunables
It's better to use proper error handling in oom_adjust_write() and
oom_score_adj_write() instead of duplicating the locking order on various
exit paths.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
Ying Han
3d5992d2ac oom: add per-mm oom disable count
It's pointless to kill a task if another thread sharing its mm cannot be
killed to allow future memory freeing.  A subsequent patch will prevent
kills in such cases, but first it's necessary to have a way to flag a task
that shares memory with an OOM_DISABLE task that doesn't incur an
additional tasklist scan, which would make select_bad_process() an O(n^2)
function.

This patch adds an atomic counter to struct mm_struct that follows how
many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
They cannot be killed by the kernel, so their memory cannot be freed in
oom conditions.

This only requires task_lock() on the task that we're operating on, it
does not require mm->mmap_sem since task_lock() pins the mm and the
operation is atomic.

[rientjes@google.com: changelog and sys_unshare() code]
[rientjes@google.com: protect oom_disable_count with task_lock in fork]
[rientjes@google.com: use old_mm for oom_disable_count in exec]
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
WANG Cong
a4f7326da2 vmcore: it is not experimental any more
We use vmcore in our production kernel for a long time, it is pretty
stable now.  So I don't think we need to mark it as experimental any more.

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
Christoph Hellwig
85fe4025c6 fs: do not assign default i_ino in new_inode
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:11 -04:00
KAMEZAWA Hiroyuki
4a3956c790 vfs: introduce FMODE_UNSIGNED_OFFSET for allowing negative f_pos
Now, rw_verify_area() checsk f_pos is negative or not.  And if negative,
returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..  has
negative offsets.  And we can't do any access via read/write to the
file(device).

So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:21 -04:00
Linus Torvalds
6c2754c28f Revert "tty: Add a new file /proc/tty/consoles"
This reverts commit f4a3e0bceb.  Jiri
Sladby points out that the tty structure we're using may already be
gone, and Al Viro doesn't hold back in complaining about the random
loading of 'filp->private_data' which doesn't have to be a pointer at
all, nor does checking the magic field for TTY_MAGIC prove anything.

Belated review by Al:

 "a) global variable depending on stdin of the last opener? Affecting
     output of read(2)? Really?

  b) iterator is broken; list should be locked in ->start(), unlocked in
     ->stop() and *NOT* unlocked/relocked in ->next()

  c) ->show() ought to do nothing in case of ->device == NULL, instead
     of skipping those in ->next()/->start()

  d) regardless of the merits of the bright idea about asterisk at that
     line in output *and* regardless of (a), the implementation is not
     only atrociously ugly, it's actually very likely to be a roothole.
     Verifying that Cthulhu knows what number happens to be address of a
     tty_struct by blindly dereferencing memory at that address...
     Ouch.

  Please revert that crap."

And Christoph pipes in and NAK's the approach of walking fd tables etc
too.  So it's pretty unanimous.

Noticed-by: Jri Slaby <jslaby@suse.cz>
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Werner Fink <werner@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-23 08:14:12 -07:00
Linus Torvalds
73ecf3a6e3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (49 commits)
  serial8250: ratelimit "too much work" error
  serial: bfin_sport_uart: speed up sport RX sample rate to be 3% faster
  serial: abstraction for 8250 legacy ports
  serial/imx: check that the buffer is non-empty before sending it out
  serial: mfd: add more baud rates support
  jsm: Remove the uart port on errors
  Alchemy: Add UART PM methods.
  8250: allow platforms to override PM hook.
  altera_uart: Don't use plain integer as NULL pointer
  altera_uart: Fix missing prototype for registering an early console
  altera_uart: Fixup type usage of port flags
  altera_uart: Make it possible to use Altera UART and 8250 ports together
  altera_uart: Add support for different address strides
  altera_uart: Add support for getting mapbase and IRQ from resources
  altera_uart: Add support for polling mode (IRQ-less)
  serial: Factor out uart_poll_timeout() from 8250 driver
  serial: mark the 8250 driver as maintained
  serial: 8250: Don't delay after transmitter is ready.
  tty: MAINTAINERS: add drivers/serial/jsm/ as maintained driver
  vcs: invoke the vt update callback when /dev/vcs* is written to
  ...
2010-10-22 19:59:04 -07:00
Linus Torvalds
092e0e7e52 Merge branch 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
  vfs: make no_llseek the default
  vfs: don't use BKL in default_llseek
  llseek: automatically add .llseek fop
  libfs: use generic_file_llseek for simple_attr
  mac80211: disallow seeks in minstrel debug code
  lirc: make chardev nonseekable
  viotape: use noop_llseek
  raw: use explicit llseek file operations
  ibmasmfs: use generic_file_llseek
  spufs: use llseek in all file operations
  arm/omap: use generic_file_llseek in iommu_debug
  lkdtm: use generic_file_llseek in debugfs
  net/wireless: use generic_file_llseek in debugfs
  drm: use noop_llseek
2010-10-22 10:52:56 -07:00
Dr. Werner Fink
f4a3e0bceb tty: Add a new file /proc/tty/consoles
Add a new file /proc/tty/consoles to be able to determine the registered
system console lines.  If the reading process holds /dev/console open at
the regular standard input stream the active device will be marked by an
asterisk.  Show possible operations and also decode the used flags of
the listed console lines.

Signed-off-by: Werner Fink <werner@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-10-22 10:20:05 -07:00
Arnd Bergmann
6038f373a3 llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.

The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.

New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time.  Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.

The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.

Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.

Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.

===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
//   but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}

@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}

@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
   *off = E
|
   *off += E
|
   func(..., off, ...)
|
   E = *off
)
...+>
}

@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}

@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
  *off = E
|
  *off += E
|
  func(..., off, ...)
|
  E = *off
)
...+>
}

@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}

@ fops0 @
identifier fops;
@@
struct file_operations fops = {
 ...
};

@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
 .llseek = llseek_f,
...
};

@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
 .read = read_f,
...
};

@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
 .write = write_f,
...
};

@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
 .open = open_f,
...
};

// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
...  .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};

@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
...  .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};

// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
...  .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};

// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};

// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};

@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+	.llseek = default_llseek, /* write accesses f_pos */
};

// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////

@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
 .write = write_f,
 .read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};

@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};

@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};

@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-10-15 15:53:27 +02:00
Jiri Olsa
3036e7b490 proc: make /proc/pid/limits world readable
Having the limits file world readable will ease the task of system
management on systems where root privileges might be restricted.

Having admin restricted with root priviledges, he/she could not check
other users process' limits.

Also it'd align with most of the /proc stat files.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Eugene Teo <eugene@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-01 10:50:59 -07:00
KOSAKI Motohiro
1c2499ae87 /proc/pid/smaps: fix dirty pages accounting
Currently, /proc/<pid>/smaps has wrong dirty pages accounting.
Shared_Dirty and Private_Dirty output only pte dirty pages and ignore
PG_dirty page flag.  It is difference against documentation, but also
inconsistent against Referenced field.  (Referenced checks both pte and
page flags)

This patch fixes it.

Test program:

 large-array.c
 ---------------------------------------------------
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>

 char array[1*1024*1024*1024L];

 int main(void)
 {
         memset(array, 1, sizeof(array));
         pause();

         return 0;
 }
 ---------------------------------------------------

Test case:
 1. run ./large-array
 2. cat /proc/`pidof large-array`/smaps
 3. swapoff -a
 4. cat /proc/`pidof large-array`/smaps again

Test result:
 <before patch>

00601000-40601000 rw-p 00000000 00:00 0
Size:            1048576 kB
Rss:             1048576 kB
Pss:             1048576 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:    218992 kB   <-- showed pages as clean incorrectly
Private_Dirty:    829584 kB
Referenced:       388364 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB

 <after patch>

00601000-40601000 rw-p 00000000 00:00 0
Size:            1048576 kB
Rss:             1048576 kB
Pss:             1048576 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:   1048576 kB  <-- fixed
Referenced:       388480 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:39 -07:00
Arnd Bergmann
c227e69028 /proc/vmcore: fix seeking
Commit 73296bc611 ("procfs: Use generic_file_llseek in /proc/vmcore")
broke seeking on /proc/vmcore.  This changes it back to use default_llseek
in order to restore the original behaviour.

The problem with generic_file_llseek is that it only allows seeks up to
inode->i_sb->s_maxbytes, which is zero on procfs and some other virtual
file systems.  We should merge generic_file_llseek and default_llseek some
day and clean this up in a proper way, but for 2.6.35/36, reverting vmcore
is the safer solution.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:38 -07:00
Takashi Iwai
ed430fec75 proc: export uncached bit properly in /proc/kpageflags
Fix the left-over old ifdef for PG_uncached in /proc/kpageflags.  Now it's
used by x86, too.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:23 -07:00
Stefan Bader
39aa3cb3e8 mm: Move vma_stack_continue into mm.h
So it can be used by all that need to check for that.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 09:05:06 -07:00
Linus Torvalds
d7824370e2 mm: fix up some user-visible effects of the stack guard page
This commit makes the stack guard page somewhat less visible to user
space. It does this by:

 - not showing the guard page in /proc/<pid>/maps

   It looks like lvm-tools will actually read /proc/self/maps to figure
   out where all its mappings are, and effectively do a specialized
   "mlockall()" in user space.  By not showing the guard page as part of
   the mapping (by just adding PAGE_SIZE to the start for grows-up
   pages), lvm-tools ends up not being aware of it.

 - by also teaching the _real_ mlock() functionality not to try to lock
   the guard page.

   That would just expand the mapping down to create a new guard page,
   so there really is no point in trying to lock it in place.

It would perhaps be nice to show the guard page specially in
/proc/<pid>/maps (or at least mark grow-down segments some way), but
let's not open ourselves up to more breakage by user space from programs
that depends on the exact deails of the 'maps' file.

Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
source code to see what was going on with the whole new warning.

Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be
Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-15 11:35:52 -07:00
Arnd Bergmann
b19dd42faf bkl: Remove locked .ioctl file operation
The last user is gone, so we can safely remove this

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-08-14 00:24:24 +02:00
Linus Torvalds
5af568cbd5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  isofs: Fix lseek() to position beyond 4 GB
  vfs: remove unused MNT_STRICTATIME
  vfs: show unreachable paths in getcwd and proc
  vfs: only add " (deleted)" where necessary
  vfs: add prepend_path() helper
  vfs: __d_path: dont prepend the name of the root dentry
  ia64: perfmon: add d_dname method
  vfs: add helpers to get root and pwd
  cachefiles: use path_get instead of lone dget
  fs/sysv/super.c: add support for non-PDP11 v7 filesystems
  V7: Adjust sanity checks for some volumes
  Add v7 alias
  v9fs: fixup for inode_setattr being removed

Manual merge to take Al's version of the fs/sysv/super.c file: it merged
cleanly, but Al had removed an unnecessary header include, so his side
was better.
2010-08-11 09:23:32 -07:00
Robert P. J. Day
cfbef3cb16 procfs: simplify conditional processing of fs/proc.o.
Since the entire fs/proc directory is conditionally included based on
CONFIG_PROC_FS, it's redundant to check that same variable within that
directory.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:20 -07:00
Miklos Szeredi
8df9d1a414 vfs: show unreachable paths in getcwd and proc
Prepend "(unreachable)" to path strings if the path is not reachable
from the current root.

Two places updated are
 - the return string from getcwd()
 - and symlinks under /proc/$PID.

Other uses of d_path() are left unchanged (we know that some old
software crashes if /proc/mounts is changed).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-11 00:29:47 -04:00
Miklos Szeredi
f7ad3c6be9 vfs: add helpers to get root and pwd
Add three helpers that retrieve a refcounted copy of the root and cwd
from the supplied fs_struct.

 get_fs_root()
 get_fs_pwd()
 get_fs_root_and_pwd()

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-11 00:28:20 -04:00
Linus Torvalds
5f248c9c25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
  no need for list_for_each_entry_safe()/resetting with superblock list
  Fix sget() race with failing mount
  vfs: don't hold s_umount over close_bdev_exclusive() call
  sysv: do not mark superblock dirty on remount
  sysv: do not mark superblock dirty on mount
  btrfs: remove junk sb_dirt change
  BFS: clean up the superblock usage
  AFFS: wait for sb synchronization when needed
  AFFS: clean up dirty flag usage
  cifs: truncate fallout
  mbcache: fix shrinker function return value
  mbcache: Remove unused features
  add f_flags to struct statfs(64)
  pass a struct path to vfs_statfs
  update VFS documentation for method changes.
  All filesystems that need invalidate_inode_buffers() are doing that explicitly
  convert remaining ->clear_inode() to ->evict_inode()
  Make ->drop_inode() just return whether inode needs to be dropped
  fs/inode.c:clear_inode() is gone
  fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
  ...

Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-10 11:26:52 -07:00
David Rientjes
51b1bd2ace oom: deprecate oom_adj tunable
/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed.  The target date for removal is August 2012.

A warning will be printed to the kernel log if a task attempts to use this
interface.  Future warning will be suppressed until the kernel is rebooted
to prevent spamming the kernel log.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:45:02 -07:00
David Rientjes
a63d83f427 oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead.  This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits.  This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.

The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory.  "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit.  The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability.  Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered.  The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:45:02 -07:00
Andrew Morton
74bcbf4054 oom: move badness() declaration into oom.h
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:45:02 -07:00
KOSAKI Motohiro
26ebc98491 oom: /proc/<pid>/oom_score treat kernel thread honestly
If a kernel thread is using use_mm(), badness() returns a positive value.
This is not a big issue because caller take care of it correctly.  But
there is one exception, /proc/<pid>/oom_score calls badness() directly and
doesn't care that the task is a regular process.

Another example, /proc/1/oom_score return !0 value.  But it's unkillable.
This incorrectness makes administration a little confusing.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:45:01 -07:00
Al Viro
8267952b36 switch procfs to ->evict_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:52 -04:00
Christoph Hellwig
1025774ce4 remove inode_setattr
Replace inode_setattr with opencoded variants of it in all callers.  This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.

In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:

 spufs: explicitly checks for ATTR_SIZE earlier
 btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
 ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:37 -04:00
David Howells
de09a9771a CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
credentials by incrementing their usage count after their replacement by the
task being accessed.

What happens is that get_task_cred() can race with commit_creds():

	TASK_1			TASK_2			RCU_CLEANER
	-->get_task_cred(TASK_2)
	rcu_read_lock()
	__cred = __task_cred(TASK_2)
				-->commit_creds()
				old_cred = TASK_2->real_cred
				TASK_2->real_cred = ...
				put_cred(old_cred)
				  call_rcu(old_cred)
		[__cred->usage == 0]
	get_cred(__cred)
		[__cred->usage == 1]
	rcu_read_unlock()
							-->put_cred_rcu()
							[__cred->usage == 1]
							panic()

However, since a tasks credentials are generally not changed very often, we can
reasonably make use of a loop involving reading the creds pointer and using
atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.

If successful, we can safely return the credentials in the knowledge that, even
if the task we're accessing has released them, they haven't gone to the RCU
cleanup code.

We then change task_state() in procfs to use get_task_cred() rather than
calling get_cred() on the result of __task_cred(), as that suffers from the
same problem.

Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
tripped when it is noticed that the usage count is not zero as it ought to be,
for example:

kernel BUG at kernel/cred.c:168!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
745
RIP: 0010:[<ffffffff81069881>]  [<ffffffff81069881>] __put_cred+0xc/0x45
RSP: 0018:ffff88019e7e9eb8  EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
FS:  00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
Stack:
 ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
<0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
<0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
Call Trace:
 [<ffffffff810698cd>] put_cred+0x13/0x15
 [<ffffffff81069b45>] commit_creds+0x16b/0x175
 [<ffffffff8106aace>] set_current_groups+0x47/0x4e
 [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
RIP  [<ffffffff81069881>] __put_cred+0xc/0x45
 RSP <ffff88019e7e9eb8>
---[ end trace df391256a100ebdd ]---

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-29 15:16:17 -07:00
Mike Frysinger
3c26c9d959 nommu: add '[stack]' label to /proc/pid/maps output
Add support to the NOMMU /proc/pid/maps file to show which mapping is the stack
of the original thread after execve.  This is largely based on the MMU code.
Subsidiary thread stacks are not indicated.

For FDPIC, we now get:

	root:/> cat /proc/self/maps
	02064000-02067ccc rw-p 0004d000 00:01 22         /bin/busybox
	0206e000-0206f35c rw-p 00006000 00:01 295        /lib/ld-uClibc.so.0
	025f0000-025f6f0c r-xs 00000000 00:01 295        /lib/ld-uClibc.so.0
	02680000-026ba6b0 r-xs 00000000 00:01 297        /lib/libc.so.0
	02700000-0274d384 r-xs 00000000 00:01 22         /bin/busybox
	02816000-02817000 rw-p 00000000 00:00 0
	02848000-0284c0d8 rw-p 00000000 00:00 0
	02860000-02880000 rw-p 00000000 00:00 0          [stack]

The semi-downside here is that for FLAT, we get:

	root:/> cat /proc/155/maps
	029f0000-029f9000 rwxp 00000000 00:00 0          [stack]

The reason being that FLAT combines a whole lot of stuff into one map
(including the stack).  But this isn't any worse than the current output
(which is nothing), so screw it.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-29 15:29:30 -07:00
Michael Ellerman
9f069af5b6 of: Drop properties with "/" in their name
Some bogus firmwares include properties with "/" in their name. This
causes problems when creating the /proc/device-tree file system,
because the slash is taken to indicate a directory.

We don't care about those properties, and we don't want to encourage
them, so just throw them away when creating /proc/device-tree.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-06-13 18:12:24 -06:00
Wu Fengguang
36e15263aa kcore: add _text to KCORE_TEXT
Extend KCORE_TEXT to cover the pages between _text and _stext, to allow
examining some important page table pages.

`readelf -a` output on x86_64 before and after patch:
	  Type           Offset             VirtAddr           PhysAddr
before    LOAD           0x00007fff8100c000 0xffffffff81009000 0x0000000000000000
after     LOAD           0x00007fff81003000 0xffffffff81000000 0x0000000000000000

The newly covered pages are:

	0xffffffff81000000 <startup_64> etc.
	0xffffffff81001000 <init_level4_pgt>
	0xffffffff81002000 <level3_ident_pgt>
	0xffffffff81003000 <level3_kernel_pgt>
	0xffffffff81004000 <level2_fixmap_pgt>
	0xffffffff81005000 <level1_fixmap_pgt>
	0xffffffff81006000 <level2_ident_pgt>
	0xffffffff81007000 <level2_kernel_pgt>
	0xffffffff81008000 <level2_spare_pgt>

Before patch, /proc/kcore shows outdated contents for the above page
table pages, for example:

	(gdb) p level3_ident_pgt
	$1 = {<text variable, no debug info>} 0xffffffff81002000 <level3_ident_pgt>
	(gdb) p/x *((pud_t *)&level3_ident_pgt)@512
	$2 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>}

while the real content is:

	root@hp /home/wfg# hexdump -s 0x1002000 -n 4096 /dev/mem
	1002000 6063 0100 0000 0000 8067 0000 0000 0000
	1002010 0000 0000 0000 0000 0000 0000 0000 0000
	*
	1003000

That is, on a x86_64 box with 2GB memory, we can see first-1GB / full-2GB
identity mapping before/after patch:

	(gdb) p/x *((pud_t *)&level3_ident_pgt)@512
before  $1 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>}
after   $1 = {{pud = 0x1006063}, {pud = 0x8067}, {pud = 0x0} <repeats 510 times>}

Obviously the content before patch is wrong.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 09:12:47 -07:00
Amerigo Wang
57f87869f0 proc: remove obsolete comments
A quick test shows these comments are obsolete, so just remove them.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 09:12:47 -07:00
Dan Carpenter
73d3646029 proc: cleanup: remove unused assignments
I removed 3 unused assignments.  The first two get reset on the first
statement of their functions.  For "err" in root.c we don't return an
error and we don't use the variable again.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 09:12:47 -07:00
Oleg Nesterov
7e49827cc9 proc: get_nr_threads() doesn't need ->siglock any longer
Now that task->signal can't go away get_nr_threads() doesn't need
->siglock to read signal->count.

Also, make it inline, move into sched.h, and convert 2 other proc users of
signal->count to use this (now trivial) helper.

Henceforth get_nr_threads() is the only valid user of signal->count, we
are ready to turn it into "int nr_threads" or, perhaps, kill it.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 09:12:47 -07:00
Naoya Horiguchi
1a5cb81465 pagemap: add #ifdefs CONFIG_HUGETLB_PAGE on code walking hugetlb vma
If !CONFIG_HUGETLB_PAGE, pagemap_hugetlb_range() is never called.  So put
it (and its calling function) into #ifdef block.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25 08:06:58 -07:00
Linus Torvalds
98c89cdd3a Merge branch 'bkl/procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'bkl/procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  sunrpc: Include missing smp_lock.h
  procfs: Kill the bkl in ioctl
  procfs: Push down the bkl from ioctl
  procfs: Use generic_file_llseek in /proc/vmcore
  procfs: Use generic_file_llseek in /proc/kmsg
  procfs: Use generic_file_llseek in /proc/kcore
  procfs: Kill BKL in llseek on proc base
2010-05-19 17:23:28 -07:00
Frederic Weisbecker
c2f980500a procfs: Kill the bkl in ioctl
There are no more users of procfs that implement the ioctl
callback. Drop the bkl from this path and warn on any use
of this callback.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-05-17 03:06:24 +02:00
Robin Holt
34441427aa revert "procfs: provide stack information for threads" and its fixup commits
Originally, commit d899bf7b ("procfs: provide stack information for
threads") attempted to introduce a new feature for showing where the
threadstack was located and how many pages are being utilized by the
stack.

Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
applied to fix the NO_MMU case.

Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
64-bit") was applied to fix a bug in ia32 executables being loaded.

Commit 9ebd4eba7 ("procfs: fix /proc/<pid>/stat stack pointer for kernel
threads") was applied to fix a bug which had kernel threads printing a
userland stack address.

Commit 1306d603f ('proc: partially revert "procfs: provide stack
information for threads"') was then applied to revert the stack pages
being used to solve a significant performance regression.

This patch nearly undoes the effect of all these patches.

The reason for reverting these is it provides an unusable value in
field 28.  For x86_64, a fork will result in the task->stack_start
value being updated to the current user top of stack and not the stack
start address.  This unpredictability of the stack_start value makes
it worthless.  That includes the intended use of showing how much stack
space a thread has.

Other architectures will get different values.  As an example, ia64
gets 0.  The do_fork() and copy_process() functions appear to treat the
stack_start and stack_size parameters as architecture specific.

I only partially reverted c44972f1 ("procfs: disable per-task stack usage
on NOMMU") .  If I had completely reverted it, I would have had to change
mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
configured.  Since I could not test the builds without significant effort,
I decided to not change mm/Makefile.

I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
information for threads on 64-bit") .  I left the KSTK_ESP() change in
place as that seemed worthwhile.

Signed-off-by: Robin Holt <holt@sgi.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11 17:33:41 -07:00
Jerome Marchand
3835541dd4 procfs: fix tid fdinfo
Correct the file_operations struct in fdinfo entry of tid_base_stuff[].

Presently /proc/*/task/*/fdinfo contains symlinks to opened files like
/proc/*/fd/.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-27 16:26:03 -07:00
Frederic Weisbecker
73296bc611 procfs: Use generic_file_llseek in /proc/vmcore
/proc/vmcore has no llseek and then falls down to use default_llseek.
This is racy against read_vmcore() that directly manipulates fpos
but it doesn't hold the bkl there so using it in llseek doesn't
protect anything.

Let's use generic_file_llseek() instead.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-04-09 17:23:24 +02:00
Frederic Weisbecker
41775e29a7 procfs: Use generic_file_llseek in /proc/kmsg
No need to hold the bkl to seek here, none of the other
fops callbacks use it.

Use generic_file_llseek explicitly.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-04-09 16:35:41 +02:00
Frederic Weisbecker
34aacb2920 procfs: Use generic_file_llseek in /proc/kcore
/proc/kcore has no llseek and then falls down to use default_llseek.
This is racy against read_kcore() that directly manipulates fpos
but it doesn't hold the bkl there so using it in llseek doesn't
protect anything.

Let's use generic_file_llseek() instead.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-04-09 16:32:02 +02:00
Arnd Bergmann
87df842410 procfs: Kill BKL in llseek on proc base
We don't use the BKL elsewhere, so use generic_file_llseek
so we can avoid default_llseek taking the BKL.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[restore proc_fdinfo_file_operations as non-seekable]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-04-09 16:29:12 +02:00
Naoya Horiguchi
116354d177 pagemap: fix pfn calculation for hugepage
When we look into pagemap using page-types with option -p, the value of
pfn for hugepages looks wrong (see below.) This is because pte was
evaluated only once for one vma although it should be updated for each
hugepage.  This patch fixes it.

  $ page-types -p 3277 -Nl -b huge
  voffset   offset  len     flags
  7f21e8a00 11e400  1       ___U___________H_G________________
  7f21e8a01 11e401  1ff     ________________TG________________
               ^^^
  7f21e8c00 11e400  1       ___U___________H_G________________
  7f21e8c01 11e401  1ff     ________________TG________________
               ^^^

One hugepage contains 1 head page and 511 tail pages in x86_64 and each
two lines represent each hugepage.  Voffset and offset mean virtual
address and physical address in the page unit, respectively.  The
different hugepages should not have the same offset value.

With this patch applied:

  $ page-types -p 3386 -Nl -b huge
  voffset   offset   len    flags
  7fec7a600 112c00   1      ___UD__________H_G________________
  7fec7a601 112c01   1ff    ________________TG________________
               ^^^
  7fec7a800 113200   1      ___UD__________H_G________________
  7fec7a801 113201   1ff    ________________TG________________
               ^^^
               OK

More info:

- This patch modifies walk_page_range()'s hugepage walker.  But the
  change only affects pagemap_read(), which is the only caller of hugepage
  callback.

- Without this patch, hugetlb_entry() callback is called per vma, that
  doesn't match the natural expectation from its name.

- With this patch, hugetlb_entry() is called per hugepte entry and the
  callback can become much simpler.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-07 08:38:04 -07:00
Dan Carpenter
309361e09c proc: copy_to_user() returns unsigned
copy_to_user() returns the number of bytes left to be copied.

This was a typo from: d82ef020cf "proc: pagemap: Hold mmap_sem during
page walk".

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-06 08:23:47 -07:00
Tejun Heo
336f5899d2 Merge branch 'master' into export-slabh 2010-04-05 11:37:28 +09:00
KAMEZAWA Hiroyuki
d82ef020cf proc: pagemap: Hold mmap_sem during page walk
In initial design, walk_page_range() was designed just for walking page
table and it didn't require mmap_sem.  Now, find_vma() etc..  are used
in walk_page_range() and we need mmap_sem around it.

This patch adds mmap_sem around walk_page_range().

Because /proc/<pid>/pagemap's callback routine use put_user(), we have
to get rid of it to do sane fix.

Changelog: 2010/Apr/2
 - fixed start_vaddr and end overflow
Changelog: 2010/Apr/1
 - fixed start_vaddr calculation
 - removed unnecessary cast.
 - removed unnecessary change in smaps.
 - use GFP_TEMPORARY instead of GFP_KERNEL

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: San Mehat <san@google.com>
Cc: Brian Swetland <swetland@google.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
[ Fixed kmalloc failure return code as per Matt ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-04 12:06:02 -07:00
Oleg Nesterov
b95c35e76b oom: fix the unsafe usage of badness() in proc_oom_score()
proc_oom_score(task) has a reference to task_struct, but that is all.
If this task was already released before we take tasklist_lock

	- we can't use task->group_leader, it points to nowhere

	- it is not safe to call badness() even if this task is
	  ->group_leader, has_intersects_mems_allowed() assumes
	  it is safe to iterate over ->thread_group list.

	- even worse, badness() can hit ->signal == NULL

Add the pid_alive() check to ensure __unhash_process() was not called.

Also, use "task" instead of task->group_leader. badness() should return
the same result for any sub-thread. Currently this is not true, but
this should be changed anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-01 08:50:21 -07:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Dan Carpenter
4fd2c20d96 kcore: fix test for end of list
"m" is never NULL here.  We need a different test for the end of list
condition.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:22 -07:00
Alexey Dobriyan
12bac0d9f4 proc: warn on non-existing proc entries
* warn if creation goes on to non-existent directory
* warn if removal goes on from non-existing directory
* warn if non-existing proc entry is removed

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:45 -08:00
Alexey Dobriyan
e17a5765f2 proc: do translation + unlink atomically at remove_proc_entry()
remove_proc_entry() does

	lock
	lookup parent
	unlock
	lock
	unlink proc entry from lists
	unlock

which can be made bit more correct by doing parent translation + unlink
without dropping lock.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:45 -08:00
Jiri Slaby
d554ed895d fs: use rlimit helpers
Make sure compiler won't do weird things with limits.  E.g.  fetching them
twice may return 2 different values after writable limits are implemented.

I.e.  either use rlimit helpers added in commit 3e10e716ab ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:29 -08:00
KAMEZAWA Hiroyuki
b084d4353f mm: count swap usage
A frequent questions from users about memory management is what numbers of
swap ents are user for processes.  And this information will give some
hints to oom-killer.

Besides we can count the number of swapents per a process by scanning
/proc/<pid>/smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'.  (ps or top is now
enough slow..)

This patch adds a counter of swapents to mm_counter and update is at each
swap events.  Information is exported via /proc/<pid>/status file as

[kamezawa@bluextal memory]$ cat /proc/self/status
Name:   cat
State:  R (running)
Tgid:   2910
Pid:    2910
PPid:   2823
TracerPid:      0
Uid:    500     500     500     500
Gid:    500     500     500     500
FDSize: 256
Groups: 500
VmPeak:    82696 kB
VmSize:    82696 kB
VmLck:         0 kB
VmHWM:       432 kB
VmRSS:       432 kB
VmData:      172 kB
VmStk:        84 kB
VmExe:        48 kB
VmLib:      1568 kB
VmPTE:        40 kB
VmSwap:        0 kB <=============== this.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:24 -08:00
KAMEZAWA Hiroyuki
d559db086f mm: clean up mm_counter
Presently, per-mm statistics counter is defined by macro in sched.h

This patch modifies it to
  - defined in mm.h as inlinf functions
  - use array instead of macro's name creation.

This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:23 -08:00
Linus Torvalds
0f2cc4ecd8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
  init: Open /dev/console from rootfs
  mqueue: fix typo "failues" -> "failures"
  mqueue: only set error codes if they are really necessary
  mqueue: simplify do_open() error handling
  mqueue: apply mathematics distributivity on mq_bytes calculation
  mqueue: remove unneeded info->messages initialization
  mqueue: fix mq_open() file descriptor leak on user-space processes
  fix race in d_splice_alias()
  set S_DEAD on unlink() and non-directory rename() victims
  vfs: add NOFOLLOW flag to umount(2)
  get rid of ->mnt_parent in tomoyo/realpath
  hppfs can use existing proc_mnt, no need for do_kern_mount() in there
  Mirror MS_KERNMOUNT in ->mnt_flags
  get rid of useless vfsmount_lock use in put_mnt_ns()
  Take vfsmount_lock to fs/internal.h
  get rid of insanity with namespace roots in tomoyo
  take check for new events in namespace (guts of mounts_poll()) to namespace.c
  Don't mess with generic_permission() under ->d_lock in hpfs
  sanitize const/signedness for udf
  nilfs: sanitize const/signedness in dealing with ->d_name.name
  ...

Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-04 08:15:33 -08:00
Al Viro
9f5596af44 take check for new events in namespace (guts of mounts_poll()) to namespace.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:59 -05:00
Helight.Xu
587d4a17d8 some clean up in fs/proc
EXPORT_SYMBOL(proc_symlink);
EXPORT_SYMBOL(proc_mkdir);
EXPORT_SYMBOL(create_proc_entry);
EXPORT_SYMBOL(proc_create_data);
EXPORT_SYMBOL(remove_proc_entry);

Those EXPORT_SYMBOL shouldn't be in fs/proc/root.c,
should be in fs/proc/generic.c.

Signed-off-by: Helight.Xu <helight.xu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:18 -05:00
James Morris
b4ccebdd37 Merge branch 'next' into for-linus 2010-03-01 09:36:31 +11:00
Linus Torvalds
642c4c75a7 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits)
  rcu: Fix accelerated GPs for last non-dynticked CPU
  rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot
  rcu: Fix accelerated grace periods for last non-dynticked CPU
  rcu: Export rcu_scheduler_active
  rcu: Make rcu_read_lock_sched_held() take boot time into account
  rcu: Make lockdep_rcu_dereference() message less alarmist
  sched, cgroups: Fix module export
  rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information
  rcu: Fix rcutorture mod_timer argument to delay one jiffy
  rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection
  rcu: Convert to raw_spinlocks
  rcu: Stop overflowing signed integers
  rcu: Use canonical URL for Mathieu's dissertation
  rcu: Accelerate grace period if last non-dynticked CPU
  rcu: Fix citation of Mathieu's dissertation
  rcu: Documentation update for CONFIG_PROVE_RCU
  security: Apply lockdep-based checking to rcu_dereference() uses
  idr: Apply lockdep-based diagnostics to rcu_dereference() uses
  radix-tree: Disable RCU lockdep checking in radix tree
  vfs: Abstract rcu_dereference_check for files-fdtable use
  ...
2010-02-28 10:13:16 -08:00
Linus Torvalds
6ebdc661b6 Merge branch 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6
* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6: (41 commits)
  of: remove undefined request_OF_resource & release_OF_resource
  of/sparc: Remove sparc-local declaration of allnodes and devtree_lock
  of: move definition of of_chosen into common code.
  of: remove unused extern reference to devtree_lock
  of: put default string compare and #a/s-cell values into common header
  of/flattree: Don't assume HAVE_LMB
  of: protect linux/of.h with CONFIG_OF
  proc_devtree: fix THIS_MODULE without module.h
  of: Remove old and misplaced function declarations
  of/flattree: Make the kernel accept ePAPR style phandle information
  of/flattree: endian-convert members of boot_param_header
  of: assume big-endian properties, adding conversions where necessary
  of: use __be32 for cell value accessors
  of/flattree: use OF_ROOT_NODE_{SIZE,ADDR}_CELLS DEFAULT for fdt parsing
  of/flattree: use callback to setup initrd from /chosen
  proc_devtree: include linux/of.h
  of: make set_node_proc_entry private to proc_devtree.c
  of: include linux/proc_fs.h
  of/flattree: merge early_init_dt_scan_memory() common code
  of: add 'of_' prefix to machine_is_compatible()
  ...
2010-02-25 15:38:37 -08:00
Paul E. McKenney
7dc5215798 vfs: Apply lockdep-based checking to rcu_dereference() uses
Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
and fcheck_files().

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:48 +01:00
Al Viro
7fee4868be Switch proc/self to nd_set_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-19 10:25:41 -05:00
Jeremy Kerr
7c540d9e3d proc_devtree: fix THIS_MODULE without module.h
Commit e22f628395 introduced a build
breakage for ARM devtree work: the THIS_MODULE macro was added, but we
don't have module.h

This change adds the necessary #include to get THIS_MODULE defined.
While we could just replace it with NULL (PROC_FS is a bool, not a
tristate), using THIS_MODULE will prevent unexpected breakage if we
ever do compile this as a module.

Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Michal Simek <monstr@monstr.eu>
2010-02-14 07:13:41 -07:00
Jeremy Kerr
50ab2fe147 proc_devtree: include linux/of.h
Currenly, proc_devtree.c depends on asm/prom.h to include linux/of.h, to
provide some device-tree definitions (eg, struct property).

Instead, include linux/of.h directly. We still need asm/prom.h for
HAVE_ARCH_DEVTREE_FIXUPS.

Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-02-09 08:34:10 -07:00
Jeremy Kerr
8cfb3343f7 of: make set_node_proc_entry private to proc_devtree.c
We only need set_node_proc_entry in proc_devtree.c, so move it there.

This fixes the !HAVE_ARCH_DEVTREE_FIXUPS build, as we can't make make
the definition in linux/of.h conditional on this #define (definitions in
asm/prom.h can't be exposed to linux/of.h, due to the enforced #include
ordering).

Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-02-09 08:34:10 -07:00
Kees Cook
d78ca3cd73 syslog: use defined constants instead of raw numbers
Right now the syslog "type" action are just raw numbers which makes
the source difficult to follow.  This patch replaces the raw numbers
with defined constants for some level of sanity.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-02-04 14:20:41 +11:00
Kees Cook
002345925e syslog: distinguish between /proc/kmsg and syscalls
This allows the LSM to distinguish between syslog functions originating
from /proc/kmsg access and direct syscalls.  By default, the commoncaps
will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg
file descriptor.  For example the kernel syslog reader can now drop
privileges after opening /proc/kmsg, instead of staying privileged with
CAP_SYS_ADMIN.  MAC systems that implement security_syslog have unchanged
behavior.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-02-04 14:20:12 +11:00
Al Viro
86acdca1b6 fix autofs/afs/etc. magic mountpoint breakage
We end up trying to kfree() nd.last.name on open("/mnt/tmp", O_CREAT)
if /mnt/tmp is an autofs direct mount.  The reason is that nd.last_type
is bogus here; we want LAST_BIND for everything of that kind and we
get LAST_NORM left over from finding parent directory.

So make sure that it *is* set properly; set to LAST_BIND before
doing ->follow_link() - for normal symlinks it will be changed
by __vfs_follow_link() and everything else needs it set that way.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14 09:05:25 -05:00
Minchan Kim
7f53a09ed4 smaps: fix wrong rss count
A long time ago we regarded zero page as file_rss and vm_normal_page
doesn't return NULL.

But now, we reinstated ZERO_PAGE and vm_normal_page's implementation can
return NULL in case of zero page.  Also we don't count it with file_rss
any more.

Then, RSS and PSS can't be matched.  For consistency, Let's ignore zero
page in smaps_pte_range.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Matt Mackall <mpm@selenic.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11 09:34:07 -08:00
KOSAKI Motohiro
1306d603fc proc: partially revert "procfs: provide stack information for threads"
Commit d899bf7b (procfs: provide stack information for threads) introduced
to show stack information in /proc/{pid}/status.  But it cause large
performance regression.  Unfortunately /proc/{pid}/status is used ps
command too and ps is one of most important component.  Because both to
take mmap_sem and page table walk are heavily operation.

If many process run, the ps performance is,

[before d899bf7b]

% perf stat ps >/dev/null

 Performance counter stats for 'ps':

     4090.435806  task-clock-msecs         #      0.032 CPUs
             229  context-switches         #      0.000 M/sec
               0  CPU-migrations           #      0.000 M/sec
             234  page-faults              #      0.000 M/sec
      8587565207  cycles                   #   2099.425 M/sec
      9866662403  instructions             #      1.149 IPC
      3789415411  cache-references         #    926.409 M/sec
        30419509  cache-misses             #      7.437 M/sec

   128.859521955  seconds time elapsed

[after d899bf7b]

% perf stat  ps  > /dev/null

 Performance counter stats for 'ps':

     4305.081146  task-clock-msecs         #      0.028 CPUs
             480  context-switches         #      0.000 M/sec
               2  CPU-migrations           #      0.000 M/sec
             237  page-faults              #      0.000 M/sec
      9021211334  cycles                   #   2095.480 M/sec
     10605887536  instructions             #      1.176 IPC
      3612650999  cache-references         #    839.160 M/sec
        23917502  cache-misses             #      5.556 M/sec

   152.277819582  seconds time elapsed

Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
provide almost same information. we can use it.

Commit d899bf7b introduced two features:

 1) Add the annotattion of [thread stack: xxxx] mark to
    /proc/{pid}/task/{tid}/maps.
 2) Add StackUsage field to /proc/{pid}/status.

I only revert (2), because I haven't seen (1) cause regression.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11 09:34:06 -08:00
Linus Torvalds
aac3d39693 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
  sched: Fix broken assertion
  sched: Assert task state bits at build time
  sched: Update task_state_arraypwith new states
  sched: Add missing state chars to TASK_STATE_TO_CHAR_STR
  sched: Move TASK_STATE_TO_CHAR_STR near the TASK_state bits
  sched: Teach might_sleep() about preemptible RCU
  sched: Make warning less noisy
  sched: Simplify set_task_cpu()
  sched: Remove the cfs_rq dependency from set_task_cpu()
  sched: Add pre and post wakeup hooks
  sched: Move kthread_bind() back to kthread.c
  sched: Fix select_task_rq() vs hotplug issues
  sched: Fix sched_exec() balancing
  sched: Ensure set_task_cpu() is never called on blocked tasks
  sched: Use TASK_WAKING for fork wakups
  sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE
  sched: Fix task_hot() test order
  sched: Fix set_cpu_active() in cpu_down()
  sched: Mark boot-cpu active before smp_init()
  sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
  ...
2009-12-19 09:47:49 -08:00
Peter Zijlstra
e1781538cf sched: Assert task state bits at build time
Since everybody is lazy and prone to forgetting things, make the
compiler help us a bit.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20091217121830.060186433@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-17 13:22:45 +01:00
Peter Zijlstra
464763cf1c sched: Update task_state_arraypwith new states
Neglected because its hidden... (who reads comments anyway)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20091217121829.970166036@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-17 13:22:45 +01:00
Linus Torvalds
d4220f987c Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
  HWPOISON: Remove stray phrase in a comment
  HWPOISON: Try to allocate migration page on the same node
  HWPOISON: Don't do early filtering if filter is disabled
  HWPOISON: Add a madvise() injector for soft page offlining
  HWPOISON: Add soft page offline support
  HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
  HWPOISON: Use new shake_page in memory_failure
  HWPOISON: Use correct name for MADV_HWPOISON in documentation
  HWPOISON: mention HWPoison in Kconfig entry
  HWPOISON: Use get_user_page_fast in hwpoison madvise
  HWPOISON: add an interface to switch off/on all the page filters
  HWPOISON: add memory cgroup filter
  memcg: add accessor to mem_cgroup.css
  memcg: rename and export try_get_mem_cgroup_from_page()
  HWPOISON: add page flags filter
  mm: export stable page flags
  HWPOISON: limit hwpoison injector to known page types
  HWPOISON: add fs/device filters
  HWPOISON: return 0 to indicate success reliably
  HWPOISON: make semantics of IGNORED/DELAYED clear
  ...
2009-12-16 12:36:49 -08:00
Christoph Hellwig
698ba7b5a3 elf: kill USE_ELF_CORE_DUMP
Currently all architectures but microblaze unconditionally define
USE_ELF_CORE_DUMP.  The microblaze omission seems like an error to me, so
let's kill this ifdef and make sure we are the same everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Simek <michal.simek@petalogix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:12 -08:00
Alexey Dobriyan
135d5655dc proc: rename de_get() to pde_get() and inline it
* de_get() is trivial -- make inline, save a few bits of code, drop
  "refcount is 0" check -- it should be done in some generic refcount
  code, don't recall it's was helpful

* rename GET and PUT functions to pde_get(), pde_put() for cool prefix!

* remove obvious and incorrent comments

* in remove_proc_entry() use pde_put(), when I fixed PDE refcounting to
  be normal one, remove_proc_entry() was supposed to do "-1" and code now
  reflects that.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:19:57 -08:00
Wu Fengguang
1a9b5b7fe0 mm: export stable page flags
Rename get_uflags() to stable_page_flags() and make it a global function
for use in the hwpoison page flags filter, which need to compare user
page flags with the value provided by user space.

Also move KPF_* to kernel-page-flags.h for use by user space tools.

Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
CC: Nick Piggin <npiggin@suse.de>
CC: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-12-16 12:19:59 +01:00
john stultz
4614a696bd procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm
Setting a thread's comm to be something unique is a very useful ability
and is helpful for debugging complicated threaded applications.  However
currently the only way to set a thread name is for the thread to name
itself via the PR_SET_NAME prctl.

However, there may be situations where it would be advantageous for a
thread dispatcher to be naming the threads its managing, rather then
having the threads self-describe themselves.  This sort of behavior is
available on other systems via the pthread_setname_np() interface.

This patch exports a task's comm via proc/pid/comm and
proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
these values.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Mike Fulton <fultonm@ca.ibm.com>
Cc: Sean Foley <Sean_Foley@ca.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:24 -08:00
Steven J. Magnani
7e1e0ef22c procfs: use proper units for noMMU statm
On no-MMU systems, sizes reported in /proc/n/statm have units of bytes.
Per Documentation/filesystems/proc.txt, these values should be in pages.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:24 -08:00
Naoya Horiguchi
5dc37642cb mm hugetlb: add hugepage support to pagemap
This patch enables extraction of the pfn of a hugepage from
/proc/pid/pagemap in an architecture independent manner.

Details
-------
My test program (leak_pagemap) works as follows:
 - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
 - read()/write() something on it,
 - call page-types with option -p,
 - munmap() and unlink() the file on hugetlbfs

Without my patches
------------------
$ ./leak_pagemap
             flags page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          1        0  __________________________________
0x0000000000000804          1        0  __R________M______________________ referenced,mmap
0x000000000000086c         81        0  __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
0x0000000000005808          5        0  ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
0x0000000000005868         12        0  ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c          1        0  __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total        101        0

The output of page-types don't show any hugepage.

With my patches
---------------
$ ./leak_pagemap
             flags page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          1        0  __________________________________
0x0000000000030000      51100      199  ________________TG________________ compound_tail,huge
0x0000000000028018        100        0  ___UD__________H_G________________ uptodate,dirty,compound_head,huge
0x0000000000000804          1        0  __R________M______________________ referenced,mmap
0x000000000000080c          1        0  __RU_______M______________________ referenced,uptodate,mmap
0x000000000000086c         80        0  __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
0x0000000000005808          4        0  ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
0x0000000000005868         12        0  ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c          1        0  __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total      51300      200

The output of page-types shows 51200 pages contributing to hugepages,
containing 100 head pages and 51100 tail pages as expected.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:24 -08:00
Benjamin Herrenschmidt
bcd6acd51f Merge commit 'origin/master' into next
Conflicts:
	include/linux/kvm.h
2009-12-09 17:14:38 +11:00
Linus Torvalds
1557d33007 Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
  security/tomoyo: Remove now unnecessary handling of security_sysctl.
  security/tomoyo: Add a special case to handle accesses through the internal proc mount.
  sysctl: Drop & in front of every proc_handler.
  sysctl: Remove CTL_NONE and CTL_UNNUMBERED
  sysctl: kill dead ctl_handler definitions.
  sysctl: Remove the last of the generic binary sysctl support
  sysctl net: Remove unused binary sysctl code
  sysctl security/tomoyo: Don't look at ctl_name
  sysctl arm: Remove binary sysctl support
  sysctl x86: Remove dead binary sysctl support
  sysctl sh: Remove dead binary sysctl support
  sysctl powerpc: Remove dead binary sysctl support
  sysctl ia64: Remove dead binary sysctl support
  sysctl s390: Remove dead sysctl binary support
  sysctl frv: Remove dead binary sysctl support
  sysctl mips/lasat: Remove dead binary sysctl support
  sysctl drivers: Remove dead binary sysctl support
  sysctl crypto: Remove dead binary sysctl support
  sysctl security/keys: Remove dead binary sysctl support
  sysctl kernel: Remove binary sysctl logic
  ...
2009-12-08 07:38:50 -08:00
Linus Torvalds
897e81bea1 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
  sched, cputime: Introduce thread_group_times()
  sched, cputime: Cleanups related to task_times()
  Revert "sched, x86: Optimize branch hint in __switch_to()"
  sched: Fix isolcpus boot option
  sched: Revert 498657a478
  sched, time: Define nsecs_to_jiffies()
  sched: Remove task_{u,s,g}time()
  sched: Introduce task_times() to replace task_{u,s}time() pair
  sched: Limit the number of scheduler debug messages
  sched.c: Call debug_show_all_locks() when dumping all tasks
  sched, x86: Optimize branch hint in __switch_to()
  sched: Optimize branch hint in context_switch()
  sched: Optimize branch hint in pick_next_task_fair()
  sched_feat_write(): Update ppos instead of file->f_pos
  sched: Sched_rt_periodic_timer vs cpu hotplug
  sched, kvm: Fix race condition involving sched_in_preempt_notifers
  sched: More generic WAKE_AFFINE vs select_idle_sibling()
  sched: Cleanup select_task_rq_fair()
  sched: Fix granularity of task_u/stime()
  sched: Fix/add missing update_rq_clock() calls
  ...
2009-12-05 15:30:49 -08:00
Hidetoshi Seto
0cf55e1ec0 sched, cputime: Introduce thread_group_times()
This is a real fix for problem of utime/stime values decreasing
described in the thread:

   http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

 - {u,s}time in task_struct are increased every time when the thread
   is interrupted by a tick (timer interrupt).

 - When a thread exits, its {u,s}time are added to signal->{u,s}time,
   after adjusted by task_times().

 - When all threads in a thread_group exits, accumulated {u,s}time
   (and also c{u,s}time) in signal struct are added to c{u,s}time
   in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

 - task_times(), to get cputime of a thread:
   This function returns adjusted values that originates from raw
   {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

 - thread_group_cputime(), to get cputime of a thread group:
   This function returns sum of all {u,s}time of living threads in
   the group, plus {u,s}time in the signal struct that is sum of
   adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:

  group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).

To fix this, we could do:

  group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for
every thread should be avoided.

This patch fixes the above problem in the following way:

 - Modify thread's exit (= __exit_signal()) not to use task_times().
   It means {u,s}time in signal struct accumulates raw values instead
   of adjusted values.  As the result it makes thread_group_cputime()
   to return pure sum of "raw" values.

 - Introduce a new function thread_group_times(*task, *utime, *stime)
   that converts "raw" values of thread_group_cputime() to "adjusted"
   values, in same calculation procedure as task_times().

 - Modify group's exit (= wait_task_zombie()) to use this introduced
   thread_group_times().  It make c{u,s}time in signal struct to
   have adjusted values like before this patch.

 - Replace some thread_group_cputime() by thread_group_times().
   This replacements are only applied where conveys the "adjusted"
   cputime to users, and where already uses task_times() near by it.
   (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

 - Before this patch, if a group contains many short-life threads
   (e.g. runs 0.9ms and not interrupted by ticks), the group's
   cputime could be invisible since thread's cputime was accumulated
   after adjusted: imagine adjustment function as adj(ticks, runtime),
     {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
   After this patch it will not happen because the adjustment is
   applied after accumulated.

v2:
 - remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-02 17:32:40 +01:00
Hidetoshi Seto
d5b7c78e97 sched: Remove task_{u,s,g}time()
Now all task_{u,s}time() pairs are replaced by task_times().
And task_gtime() is too simple to be an inline function.

Cleanup them all.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
LKML-Reference: <4B0E16D1.70902@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-26 12:59:20 +01:00
Hidetoshi Seto
d180c5bcce sched: Introduce task_times() to replace task_{u,s}time() pair
Functions task_{u,s}time() are called in pair in almost all
cases.  However task_stime() is implemented to call task_utime()
from its inside, so such paired calls run task_utime() twice.

It means we do heavy divisions (div_u64 + do_div) twice to get
utime and stime which can be obtained at same time by one set
of divisions.

This patch introduces a function task_times(*tsk, *utime,
*stime) to retrieve utime and stime at once in better, optimized
way.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
LKML-Reference: <4B0E16AE.906@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-26 12:59:19 +01:00
Ingo Molnar
16bc67edeb Merge branch 'sched/urgent' into sched/core
Merge reason: Pick up fixes that did not make it into .32.0

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-26 10:50:42 +01:00
Benjamin Herrenschmidt
1e43bee9c7 Merge commit 'origin/master' into next 2009-11-24 17:16:30 +11:00
Stefani Seibold
9ebd4eba76 procfs: fix /proc/<pid>/stat stack pointer for kernel threads
Fix a small issue for the stack pointer in /proc/<pid>/stat.  In case of a
kernel thread the value of the printed stack pointer should be 0.

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-17 17:40:33 -08:00
Eric W. Biederman
bb9074ff58 Merge commit 'v2.6.32-rc7'
Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler
gets a small bug fix and the sysctl tree where I am removing all
sysctl strategy routines.
2009-11-17 01:01:34 -08:00
Sukadev Bhattiprolu
29f12ca321 pidns: fix a leak in /proc dentries and inodes with pid namespaces.
Daniel Lezcano reported a leak in 'struct pid' and 'struct pid_namespace'
that is discussed in:

	http://lkml.org/lkml/2009/10/2/159.

To summarize the thread, when container-init is terminated, it sets the
PF_EXITING flag, zaps other processes in the container and waits to reap
them.  As a part of reaping, the container-init should flush any /proc
dentries associated with the processes.  But because the container-init is
itself exiting and the following PF_EXITING check, the dentries are not
flushed, resulting in leak in /proc inodes and dentries.

This fix reverts the commit 7766755a2f ("Fix /proc dcache deadlock
in do_exit") which introduced the check for PF_EXITING.  At the time of
the commit, shrink_dcache_parent() flushed dentries from other filesystems
also and could have caused a deadlock which the commit fixed.  But as
pointed out by Eric Biederman, after commit 0feae5c47a,
shrink_dcache_parent() no longer affects other filesystems.  So reverting
the commit is now safe.

As pointed out by Jan Kara, the leak is not as critical since the
unclaimed space will be reclaimed under memory pressure or by:

	echo 3 > /proc/sys/vm/drop_caches

But since this check is no longer required, its best to remove it.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Jan Kara <jack@ucw.cz>
Cc: Andrea Arcangeli <andrea@cpushare.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-12 07:25:57 -08:00
Eric W. Biederman
2315ffa0a9 sysctl: Don't look at ctl_name and strategy in the generic code
The ctl_name and strategy fields are unused, now that sys_sysctl
is a compatibility wrapper around /proc/sys.  No longer looking
at them in the generic code is effectively what we are doing
now and provides the guarantee that during further cleanups
we can just remove references to those fields and everything
will work ok.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2009-11-11 00:53:43 -08:00
Alexey Dobriyan
e22f628395 Convert /proc/device-tree/ to seq_file
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-30 17:20:51 +11:00
Hugh Dickins
370c28def6 hwpoison: fix/proc/meminfo alignment
Given such a long name, the kB count in /proc/meminfo's HardwareCorrupted
line is being shown too far right (it does align with x86_64's VmallocChunk
above, but I hope nobody will ever have that much corrupted!).  Align it.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:25 -07:00
Ryota Ozaki
ce0e7b28fb sched, cpuacct: Fix niced guest time accounting
CPU time of a guest is always accounted in 'user' time
without concern for the nice value of its counterpart
process although the guest is scheduled under the nice
value.

This patch fixes the defect and accounts cpu time of
a niced guest in 'nice' time as same as a niced process.

And also the patch adds 'guest_nice' to cpuacct. The
value provides niced guest cpu time which is like 'nice'
to 'user'.

The original discussions can be found here:

  http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.html
  http://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html

Signed-off-by: Ryota Ozaki <ozaki.ryota@gmail.com>
Acked-by: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1256314810-7897-1-git-send-email-ozaki.ryota@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-25 17:31:30 +01:00
Ingo Molnar
0b9e31e926 Merge branch 'linus' into sched/core
Conflicts:
	fs/proc/array.c

Merge reason: resolve conflict and queue up dependent patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-25 17:30:53 +01:00
Wu Fengguang
253fb02d62 pagemap: export KPF_HWPOISON
This flag indicates a hardware detected memory corruption on the page.
Any future access of the page data may bring down the machine.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-08 07:36:39 -07:00
Jaswinder Singh Rajput
4055e97318 fs: includecheck fix: proc, kcore.c
fix the following 'make includecheck' warning:

  fs/proc/kcore.c: linux/mm.h is included more than once.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-08 07:36:38 -07:00
Andrew Morton
c44972f178 procfs: disable per-task stack usage on NOMMU
It needs walk_page_range().

Reported-by: Michal Simek <monstr@monstr.eu>
Tested-by: Michal Simek <monstr@monstr.eu>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 17:11:24 -07:00
Linus Torvalds
7ca263cdf8 Merge branch 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6
* 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6:
  [PATCH] Fix idle time field in /proc/uptime
2009-09-24 09:04:24 -07:00
Linus Torvalds
db16826367 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
  HWPOISON: Enable error_remove_page on btrfs
  HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  HWPOISON: Add madvise() based injector for hardware poisoned pages v4
  HWPOISON: Enable error_remove_page for NFS
  HWPOISON: Enable .remove_error_page for migration aware file systems
  HWPOISON: The high level memory error handler in the VM v7
  HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
  HWPOISON: shmem: call set_page_dirty() with locked page
  HWPOISON: Define a new error_remove_page address space op for async truncation
  HWPOISON: Add invalidate_inode_page
  HWPOISON: Refactor truncate to allow direct truncating of page v2
  HWPOISON: check and isolate corrupted free pages v2
  HWPOISON: Handle hardware poisoned pages in try_to_unmap
  HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
  HWPOISON: Add poison check to page fault handling
  HWPOISON: Add basic support for poisoned pages in fault handler v3
  HWPOISON: Add new SIGBUS error codes for hardware poison signals
  HWPOISON: Add support for poison swap entries v2
  HWPOISON: Export some rmap vma locking to outside world
  ...
2009-09-24 07:53:22 -07:00
Alexey Dobriyan
8d65af789f sysctl: remove "struct file *" argument of ->proc_handler
It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:21:04 -07:00
Michael Abbott
96830a57de [PATCH] Fix idle time field in /proc/uptime
Git commit 79741dd changes idle cputime accounting, but unfortunately
the /proc/uptime file hasn't caught up.  Here the idle time calculation
from /proc/stat is copied over.

Signed-off-by: Michael Abbott <michael.abbott@diamond.ac.uk>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-09-24 10:16:24 +02:00
KAMEZAWA Hiroyuki
0d4c36a9b6 /proc/kcore: update stat.st_size after memory hotplug
After memory hotplug (or other events in future), kcore size can be
modified.

To update inode->i_size, we have to know inode/dentry but we can't get it
from inside /proc directly.  But considerinyg memory hotplug, kcore image
is updated only when it's opened.  Then, updating inode->i_size at open()
is enough.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:42 -07:00
KAMEZAWA Hiroyuki
678ad5d8aa /proc/kcore: fix stat.st_size
Presently the size of /proc/kcore which can be read by 'ls -l' is 0.  But
it's not the correct value.

On x86-64, ls -l shows
 ... root root 140737486266368 2009-09-17 10:29 /proc/kcore
Then, 7FFFFFFE02000. This comes from vmalloc area's size.
(*) This shows "core" size, not  memory size.

This patch shows the size by updating "size" field in struct
proc_dir_entry.  Later, lookup routine will create inode and fill
inode->i_size based on this value.  Then, this has a problem.

 - Once inode is cached, inode->i_size will never be updated.

Then, this patch is not memory-hotplug-aware.

To update inode->i_size, we have to know dentry or inode.
But there is no way to lookup them by inside kernel. Hmmm....
Next patch will try it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:42 -07:00
KAMEZAWA Hiroyuki
90396f96b7 kcore: more fixes for init
proc_kcore_init() doesn't check NULL case.  fix it and remove unnecessary
comments.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:42 -07:00
KAMEZAWA Hiroyuki
81ac3ad906 kcore: register module area in generic way
Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
This is handled only in x86-64.  This patch make it more generic.  And we
can use vread/vwrite to access the area.  Fix it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:42 -07:00
KAMEZAWA Hiroyuki
26562c59fa kcore: register vmemmap range
Benjamin Herrenschmidt <benh@kernel.crashing.org> pointed out that vmemmap
range is not included in KCORE_RAM, KCORE_VMALLOC ....

This adds KCORE_VMEMMAP if SPARSEMEM_VMEMMAP is used.  By this, vmemmap
can be readable via /proc/kcore

Because it's not vmalloc area, vread/vwrite cannot be used.  But the range
is static against the memory layout, this patch handles vmemmap area by
the same scheme with physical memory.

This patch assumes SPARSEMEM_VMEMMAP range is not in VMALLOC range.  It's
correct now.

[akpm@linux-foundation.org: fix typo]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki
3089aa1b0c kcore: use registerd physmem information
For /proc/kcore, each arch registers its memory range by kclist_add().
In usual,

	- range of physical memory
	- range of vmalloc area
	- text, etc...

are registered but "range of physical memory" has some troubles.  It
doesn't updated at memory hotplug and it tend to include unnecessary
memory holes.  Now, /proc/iomem (kernel/resource.c) includes required
physical memory range information and it's properly updated at memory
hotplug.  Then, it's good to avoid using its own code(duplicating
information) and to rebuild kclist for physical memory based on
/proc/iomem.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki
9492587cf3 kcore: register text area in generic way
Some 64bit arch has special segment for mapping kernel text.  It should be
entried to /proc/kcore in addtion to direct-linear-map, vmalloc area.
This patch unifies KCORE_TEXT entry scattered under x86 and ia64.

I'm not familiar with other archs (mips has its own even after this patch)
but range of [_stext ..._end) is a valid area of text and it's not in
direct-map area, defining CONFIG_ARCH_PROC_KCORE_TEXT is only a necessary
thing to do.

Note: I left mips as it is now.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki
a0614da88b kcore: register vmalloc area in generic way
For /proc/kcore, vmalloc areas are registered per arch.  But, all of them
registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies
them.  By this.  archs which have no kclist_add() hooks can see vmalloc
area correctly.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki
c30bb2a25f kcore: add kclist types
Presently, kclist_add() only eats start address and size as its arguments.
Considering to make kclist dynamically reconfigulable, it's necessary to
know which kclists are for System RAM and which are not.

This patch add kclist types as
  KCORE_RAM
  KCORE_VMALLOC
  KCORE_TEXT
  KCORE_OTHER

This "type" is used in a patch following this for detecting KCORE_RAM.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki
2ef43ec772 kcore: use usual list for kclist
This patchset is for /proc/kcore.  With this,

 - many per-arch hooks are removed.

 - /proc/kcore will know really valid physical memory area.

 - /proc/kcore will be aware of memory hotplug.

 - /proc/kcore will be architecture independent i.e.
   if an arch supports CONFIG_MMU, it can use /proc/kcore.
   (if the arch uses usual memory layout.)

This patch:

/proc/kcore uses its own list handling codes. It's better to use
generic list codes.

No changes in logic. just clean up.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:41 -07:00
Stefani Seibold
d899bf7b55 procfs: provide stack information for threads
A patch to give a better overview of the userland application stack usage,
especially for embedded linux.

Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value.  But you get no
information about the consumed stack memory of the the threads.

There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]".  xxxxxxxx is the maximum size of stack.  This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.

A sample output of /proc/<pid>/task/<tid>/maps looks like:

08048000-08049000 r-xp 00000000 03:00 8312       /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312       /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0          [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222       /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222       /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222       /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0          [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]

Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
which will you give the current stack usage in kb.

A sample output of /proc/self/status looks like:

Name:	cat
State:	R (running)
Tgid:	507
Pid:	507
.
.
.
CapBnd:	fffffffffffffeff
voluntary_ctxt_switches:	0
nonvoluntary_ctxt_switches:	0
Stack usage:	12 kB

I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process.  This makes more sense.

[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:41 -07:00
Vincent Li
cba8aafe1e fs/proc/base.c: fix proc_fault_inject_write() input sanity check
Remove obfuscated zero-length input check and return -EINVAL instead of
-EIO error to make the error message clear to user.  Add whitespace
stripping.  No functionality changes.

The old code:

echo  1  > /proc/pid/make-it-fail (ok)
echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Input/output error)

The new code:

echo  1  > /proc/pid/make-it-fail (ok)
echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Invalid argument)

This patch is conservative in changes to not breaking existing
scripts/applications.

Signed-off-by: Vincent Li <macli@brc.ubc.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:40 -07:00
Vincent Li
fb92a4b068 fs/proc/task_mmu.c v1: fix clear_refs_write() input sanity check
Andrew Morton pointed out similar string hacking and obfuscated check for
zero-length input at the end of the function, David Rientjes suggested to
use strict_strtol to replace simple_strtol, this patch cover above
suggestions, add removing of leading and trailing whitespace from user
input.  It does not change function behavious.

Signed-off-by: Vincent Li <macli@brc.ubc.ca>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Amerigo Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:40 -07:00
Amerigo Wang
acef82b873 kcore: fix /proc/kcore's stat.st_size
In 9063c61fd5 ("x86, 64-bit: Clean up user address masking") Linus
fixed the wrong size of /proc/kcore problem.

But its size still looks insane, since it never equals the size of
physical memory.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tao Ma <tao.ma@oracle.com>
Cc: <mtk.manpages@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:40 -07:00
Oleg Nesterov
9b4d1cbef8 proc_flush_task: flush /proc/tid/task/pid when a sub-thread exits
The exiting sub-thread flushes /proc/pid only, but this doesn't buy too
much: ps and friends mostly use /proc/tid/task/pid.

Remove "if (thread_group_leader())" checks from proc_flush_task() path,
this means we always remove /proc/tid/task/pid dentry on exit, and this
actually matches the comment above proc_flush_task().

The test-case:

	static void* tfunc(void *arg)
	{
		char name[256];

		sprintf(name, "/proc/%d/task/%ld/status", getpid(), gettid());
		close(open(name, O_RDONLY));

		return NULL;
	}

	int main(void)
	{
		pthread_t t;

		for (;;) {
			if (!pthread_create(&t, NULL, &tfunc, NULL))
				pthread_join(t, NULL);
		}
	}

slabtop shows that pid/proc_inode_cache/etc grow quickly and
"indefinitely" until the task is killed or shrink_slab() is called, not
good.  And the main thread needs a lot of time to exit.

The same can happen if something like "ps -efL" runs continuously, while
some application spawns short-living threads.

Reported-by: "James M. Leddy" <jleddy@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dominic Duval <dduval@redhat.com>
Cc: Frank Hirtz <fhirtz@redhat.com>
Cc: "Fuller, Johnray" <Johnray.Fuller@gs.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Paul Batkowski <pbatkowski@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:40 -07:00
Kees Cook
cff4edb591 proc: fix reported unit for RLIMIT_CPU
/proc/$pid/limits should show RLIMIT_CPU as seconds, which is the unit
used in kernel/posix-cpu-timers.c:

        unsigned long psecs = cputime_to_secs(ptime);
        ...
        if (psecs >= sig->rlim[RLIMIT_CPU].rlim_max) {
                ...
                __group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:40 -07:00
James Morris
88e9d34c72 seq_file: constify seq_operations
Make all seq_operations structs const, to help mitigate against
revectoring user-triggerable function pointers.

This is derived from the grsecurity patch, although generated from scratch
because it's simpler than extracting the changes from there.

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:29 -07:00
KOSAKI Motohiro
5d863b8968 oom: fix oom_adjust_write() input sanity check
Andrew Morton pointed out oom_adjust_write() has very strange EIO
and new line handling. this patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:39 -07:00
KOSAKI Motohiro
495789a51a oom: make oom_score to per-process value
oom-killer kills a process, not task.  Then oom_score should be calculated
as per-process too.  it makes consistency more and makes speed up
select_bad_process().

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:39 -07:00
KOSAKI Motohiro
28b83c5193 oom: move oom_adj value from task_struct to signal_struct
Currently, OOM logic callflow is here.

    __out_of_memory()
        select_bad_process()            for each task
            badness()                   calculate badness of one task
                oom_kill_process()      search child
                    oom_kill_task()     kill target task and mm shared tasks with it

example, process-A have two thread, thread-A and thread-B and it have very
fat memory and each thread have following oom_adj and oom_score.

     thread-A: oom_adj = OOM_DISABLE, oom_score = 0
     thread-B: oom_adj = 0,           oom_score = very-high

Then, select_bad_process() select thread-B, but oom_kill_task() refuse
kill the task because thread-A have OOM_DISABLE.  Thus __out_of_memory()
call select_bad_process() again.  but select_bad_process() select the same
task.  It mean kernel fall in livelock.

The fact is, select_bad_process() must select killable task.  otherwise
OOM logic go into livelock.

And root cause is, oom_adj shouldn't be per-thread value.  it should be
per-process value because OOM-killer kill a process, not thread.  Thus
This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct signal_struct.  it naturally prevent
select_bad_process() choose wrong task.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:39 -07:00
KAMEZAWA Hiroyuki
73d7c33e81 kcore: /proc/kcore should use vread
/proc/kcore has its own routine to access vmallc area.  It can be replaced
with vread().  And by this, /proc/kcore can do safe access to vmalloc
area.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Mike Smith <scgtrp@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:34 -07:00
Moussa A. Ba
398499d5f3 pagemap clear_refs: modify to specify anon or mapped vma clearing
The patch makes the clear_refs more versatile in adding the option to
select anonymous pages or file backed pages for clearing.  This addition
has a measurable impact on user space application performance as it
decreases the number of pagewalks in scenarios where one is only
interested in a specific type of page (anonymous or file mapped).

The patch adds anonymous and file backed filters to the clear_refs interface.

echo 1 > /proc/PID/clear_refs resets the bits on all pages
echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only

Any other value is ignored

Signed-off-by: Moussa A. Ba <moussa.a.ba@gmail.com>
Signed-off-by: Jared E. Hulbert <jaredeh@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:33 -07:00
Hugh Dickins
9a84089514 ksm: identify PageKsm pages
KSM will need to identify its kernel merged pages unambiguously, and
/proc/kpageflags will probably like to do so too.

Since KSM will only be substituting anonymous pages, statistics are best
preserved by making a PageKsm page a special PageAnon page: one with no
anon_vma.

But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near
PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:31 -07:00
KOSAKI Motohiro
4b02108ac1 mm: oom analysis: add shmem vmstat
Recently we encountered OOM problems due to memory use of the GEM cache.
Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
shortage problem.

We often use the following calculation to determine the amount of shmem
pages:

shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

however the expression does not consider isolated and mlocked pages.

This patch adds explicit accounting for pages used by shmem and tmpfs.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:27 -07:00
KOSAKI Motohiro
c6a7f5728a mm: oom analysis: Show kernel stack usage in /proc/meminfo and OOM log output
The amount of memory allocated to kernel stacks can become significant and
cause OOM conditions.  However, we do not display the amount of memory
consumed by stacks.

Add code to display the amount of memory used for stacks in /proc/meminfo.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:27 -07:00
Heiko Carstens
d01d482785 sched: Always show Cpus_allowed field in /proc/<pid>/status
The Cpus_allowed fields in /proc/<pid>/status is currently only
shown in case of CONFIG_CPUSETS. However their contents are also
useful for the !CONFIG_CPUSETS case.

So change the current behaviour and always show these fields.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090921090627.GD4649@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 11:37:27 +02:00
Andi Kleen
6a46079cf5 HWPOISON: The high level memory error handler in the VM v7
Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

High level machine check handler. Handles pages reported by the
hardware as being corrupted usually due to a 2bit ECC memory or cache
failure.

This focuses on pages detected as corrupted in the background.
When the current CPU tries to consume corruption the currently
running process can just be killed directly instead. This implies
that if the error cannot be handled for some reason it's safe to
just ignore it because no corruption has been consumed yet. Instead
when that happens another machine check will happen.

Handles page cache pages in various states. The tricky part
here is that we can access any page asynchronous to other VM
users, because memory failures could happen anytime and anywhere,
possibly violating some of their assumptions. This is why this code
has to be extremely careful. Generally it tries to use normal locking
rules, as in get the standard locks, even if that means the
error handling takes potentially a long time.

Some of the operations here are somewhat inefficient and have non
linear algorithmic complexity, because the data structures have not
been optimized for this case. This is in particular the case
for the mapping from a vma to a process. Since this case is expected
to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
Nick Piggin (who did a lot of great work) and others.

Cc: npiggin@suse.de
Cc: riel@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
2009-09-16 11:50:15 +02:00
KOSAKI Motohiro
0753ba01e1 mm: revert "oom: move oom_adj value"
The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
the mm_struct.  It was a very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job
scheduler.  Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

	...
	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
	...
	if (vfork() == 0) {
		set_oom_adj(0); /* Invoked child can be killed */
		execve("foo-bar-cmd");
	}
	....

vfork() parent and child are shared the same mm_struct.  then above
set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
change oom_adj for vfork() parent.  Then, vfork() parent (job scheduler)
lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program.
We must not break this assumption.

Then, this patch revert commit 2ff05b2b and related commit.

Reverted commit list
---------------------
- commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
- commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
- commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
- commit 933b787b57 (mm: copy over oom_adj value at fork time)

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-18 16:31:13 -07:00
Oleg Nesterov
704b836cbf mm_for_maps: take ->cred_guard_mutex to fix the race with exec
The problem is minor, but without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.

Now we do not need to re-check task->mm after ptrace_may_access(), it
can't be changed to the new mm under us.

Strictly speaking, this also fixes another very minor problem. Unless
security check fails or the task exits mm_for_maps() should never
return NULL, the caller should get either old or new ->mm.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-08-10 20:49:26 +10:00
Oleg Nesterov
00f89d2185 mm_for_maps: shift down_read(mmap_sem) to the caller
mm_for_maps() takes ->mmap_sem after security checks, this looks
strange and obfuscates the locking rules. Move this lock to its
single caller, m_start().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-08-10 20:48:32 +10:00
Oleg Nesterov
13f0feafa6 mm_for_maps: simplify, use ptrace_may_access()
It would be nice to kill __ptrace_may_access(). It requires task_lock(),
but this lock is only needed to read mm->flags in the middle.

Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
the code a little bit.

Also, we do not need to take ->mmap_sem in advance. In fact I think
mm_for_maps() should not play with ->mmap_sem at all, the caller should
take this lock.

With or without this patch, without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-08-10 20:47:42 +10:00
Cyrill Gorcunov
2f6d311080 proc: vmcore - use kzalloc in get_new_element()
Instead of kmalloc+memset better use straight kzalloc

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:41 -07:00
Michal Simek
bcac2b1b7d procfs: remove sparse errors in proc_devtree.c
CHECK   fs/proc/proc_devtree.c
fs/proc/proc_devtree.c:197:14: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:203:34: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:210:14: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:223:26: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:226:14: warning: Using plain integer as NULL pointer

Signed-off-by: Michal Simek <monstr@monstr.eu>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:41 -07:00
Keika Kobayashi
d3d64df21d proc: export statistics for softirq to /proc
Export statistics for softirq in /proc/softirqs and /proc/stat.

1. /proc/softirqs
Implement /proc/softirqs which shows the number of softirq
for each CPU like /proc/interrupts.

2. /proc/stat
Add the "softirq" line to /proc/stat.
This line shows the number of softirq for all cpu.
The first column is the total of all softirqs and
each subsequent column is the total for particular softirq.

[kosaki.motohiro@jp.fujitsu.com: remove redundant for_each_possible_cpu() loop]
Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Reviewed-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:41 -07:00
David Rientjes
2ff05b2b4e oom: move oom_adj value from task_struct to mm_struct
The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm.  If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct.  This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already.  They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:43 -07:00
KOSAKI Motohiro
6837765963 mm: remove CONFIG_UNEVICTABLE_LRU config option
Currently, nobody wants to turn UNEVICTABLE_LRU off.  Thus this
configurability is unnecessary.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:42 -07:00
Wu Fengguang
1779754959 proc: export more page flags in /proc/kpageflags
Export all page flags faithfully in /proc/kpageflags.

	11. KPF_MMAP		(pseudo flag) memory mapped page
	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
	13. KPF_SWAPCACHE	page is in swap cache
	14. KPF_SWAPBACKED	page is swap/RAM backed
	15. KPF_COMPOUND_HEAD	(*)
	16. KPF_COMPOUND_TAIL	(*)
	17. KPF_HUGE		hugeTLB pages
	18. KPF_UNEVICTABLE	page is in the unevictable LRU list
	19. KPF_HWPOISON(TBD)	hardware detected corruption
	20. KPF_NOPAGE		(pseudo flag) no page frame at the address
	32-39.			more obscure flags for kernel developers

	(*) For compound pages, exporting _both_ head/tail info enables
	    users to tell where a compound page starts/ends, and its order.

The accompanying page-types tool will handle the details like decoupling
overloaded flags and hiding obscure flags to normal users.

Thanks to KOSAKI and Andi for their valuable recommendations!

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:38 -07:00
Wu Fengguang
ed7ce0f102 proc: kpagecount/kpageflags code cleanup
Move increments of pfn/out to bottom of the loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:36 -07:00
Wu Fengguang
20a0307c03 mm: introduce PageHuge() for testing huge/gigantic pages
A series of patches to enhance the /proc/pagemap interface and to add a
userspace executable which can be used to present the pagemap data.

Export 10 more flags to end users (and more for kernel developers):

        11. KPF_MMAP            (pseudo flag) memory mapped page
        12. KPF_ANON            (pseudo flag) memory mapped page (anonymous)
        13. KPF_SWAPCACHE       page is in swap cache
        14. KPF_SWAPBACKED      page is swap/RAM backed
        15. KPF_COMPOUND_HEAD   (*)
        16. KPF_COMPOUND_TAIL   (*)
        17. KPF_HUGE		hugeTLB pages
        18. KPF_UNEVICTABLE     page is in the unevictable LRU list
        19. KPF_HWPOISON        hardware detected corruption
        20. KPF_NOPAGE          (pseudo flag) no page frame at the address

        (*) For compound pages, exporting _both_ head/tail info enables
            users to tell where a compound page starts/ends, and its order.

a simple demo of the page-types tool

# ./page-types -h
page-types [options]
            -r|--raw                  Raw mode, for kernel developers
            -a|--addr    addr-spec    Walk a range of pages
            -b|--bits    bits-spec    Walk pages with specified bits
            -l|--list                 Show page details in ranges
            -L|--list-each            Show page details one by one
            -N|--no-summary           Don't show summay info
            -h|--help                 Show this usage message
addr-spec:
            N                         one page at offset N (unit: pages)
            N+M                       pages range from N to N+M-1
            N,M                       pages range from N to M-1
            N,                        pages range from N to end
            ,M                        pages range from 0 to M
bits-spec:
            bit1,bit2                 (flags & (bit1|bit2)) != 0
            bit1,bit2=bit1            (flags & (bit1|bit2)) == bit1
            bit1,~bit2                (flags & (bit1|bit2)) == bit1
            =bit1,bit2                flags == (bit1|bit2)
bit-names:
          locked              error         referenced           uptodate
           dirty                lru             active               slab
       writeback            reclaim              buddy               mmap
       anonymous          swapcache         swapbacked      compound_head
   compound_tail               huge        unevictable           hwpoison
          nopage           reserved(r)         mlocked(r)    mappedtodisk(r)
         private(r)       private_2(r)   owner_private(r)            arch(r)
        uncached(r)       readahead(o)       slob_free(o)     slub_frozen(o)
      slub_debug(o)
                                   (r) raw mode bits  (o) overloaded bits

# ./page-types
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          487369     1903  _________________________________
0x0000000000000014               5        0  __R_D____________________________  referenced,dirty
0x0000000000000020               1        0  _____l___________________________  lru
0x0000000000000024              34        0  __R__l___________________________  referenced,lru
0x0000000000000028            3838       14  ___U_l___________________________  uptodate,lru
0x0001000000000028              48        0  ___U_l_______________________I___  uptodate,lru,readahead
0x000000000000002c            6478       25  __RU_l___________________________  referenced,uptodate,lru
0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
0x0000000000000040            8344       32  ______A__________________________  active
0x0000000000000060               1        0  _____lA__________________________  lru,active
0x0000000000000068             348        1  ___U_lA__________________________  uptodate,lru,active
0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
0x000000000000006c             988        3  __RU_lA__________________________  referenced,uptodate,lru,active
0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400             503        1  __________B______________________  buddy
0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000             492        1  ____________a____________________  anonymous
0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c              30        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total          513968     2007

# ./page-types -r
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          468002     1828  _________________________________
0x0000000100000000           19102       74  _____________________r___________  reserved
0x0000000000008000              41        0  _______________H_________________  compound_head
0x0000000000010000             188        0  ________________T________________  compound_tail
0x0000000000008014               1        0  __R_D__________H_________________  referenced,dirty,compound_head
0x0000000000010014               4        0  __R_D___________T________________  referenced,dirty,compound_tail
0x0000000000000020               1        0  _____l___________________________  lru
0x0000000800000024              34        0  __R__l__________________P________  referenced,lru,private
0x0000000000000028            3794       14  ___U_l___________________________  uptodate,lru
0x0001000000000028              46        0  ___U_l_______________________I___  uptodate,lru,readahead
0x0000000400000028              44        0  ___U_l_________________d_________  uptodate,lru,mappedtodisk
0x0001000400000028               2        0  ___U_l_________________d_____I___  uptodate,lru,mappedtodisk,readahead
0x000000000000002c            6434       25  __RU_l___________________________  referenced,uptodate,lru
0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
0x000000040000002c              14        0  __RU_l_________________d_________  referenced,uptodate,lru,mappedtodisk
0x000000080000002c              30        0  __RU_l__________________P________  referenced,uptodate,lru,private
0x0000000800000040            8124       31  ______A_________________P________  active,private
0x0000000000000040             219        0  ______A__________________________  active
0x0000000800000060               1        0  _____lA_________________P________  lru,active,private
0x0000000000000068             322        1  ___U_lA__________________________  uptodate,lru,active
0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
0x0000000400000068              13        0  ___U_lA________________d_________  uptodate,lru,active,mappedtodisk
0x0000000800000068              12        0  ___U_lA_________________P________  uptodate,lru,active,private
0x000000000000006c             977        3  __RU_lA__________________________  referenced,uptodate,lru,active
0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
0x000000040000006c               5        0  __RU_lA________________d_________  referenced,uptodate,lru,active,mappedtodisk
0x000000080000006c               3        0  __RU_lA_________________P________  referenced,uptodate,lru,active,private
0x0000000c0000006c               3        0  __RU_lA________________dP________  referenced,uptodate,lru,active,mappedtodisk,private
0x0000000c00000068               1        0  ___U_lA________________dP________  uptodate,lru,active,mappedtodisk,private
0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400             538        2  __________B______________________  buddy
0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000             492        1  ____________a____________________  anonymous
0x0000000000005008               2        0  ___U________a_b__________________  uptodate,anonymous,swapbacked
0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
0x000000000000580c               1        0  __RU_______Ma_b__________________  referenced,uptodate,mmap,anonymous,swapbacked
0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c              29        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total          513968     2007

# ./page-types --raw --list --no-summary --bits reserved
offset  count   flags
0       15      _____________________r___________
31      4       _____________________r___________
159     97      _____________________r___________
4096    2067    _____________________r___________
6752    2390    _____________________r___________
9355    3       _____________________r___________
9728    14526   _____________________r___________

This patch:

Introduce PageHuge(), which identifies huge/gigantic pages by their
dedicated compound destructor functions.

Also move prep_compound_gigantic_page() to hugetlb.c and make
__free_pages_ok() non-static.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:36 -07:00
Al Viro
3174c21b74 Move junk from proc_fs.h to fs/proc/internal.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
James Morris
73fbad283c Merge branch 'next' into for-linus 2009-06-11 11:03:14 +10:00
Linus Torvalds
99e97b860e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: fix typo in sched-rt-group.txt file
  ftrace: fix typo about map of kernel priority in ftrace.txt file.
  sched: properly define the sched_group::cpumask and sched_domain::span fields
  sched, timers: cleanup avenrun users
  sched, timers: move calc_load() to scheduler
  sched: Don't export sched_mc_power_savings on multi-socket single core system
  sched: emit thread info flags with stack trace
  sched: rt: document the risk of small values in the bandwidth settings
  sched: Replace first_cpu() with cpumask_first() in ILB nomination code
  sched: remove extra call overhead for schedule()
  sched: use group_first_cpu() instead of cpumask_first(sched_group_cpus())
  wait: don't use __wake_up_common()
  sched: Nominate a power-efficient ilb in select_nohz_balancer()
  sched: Nominate idle load balancer from a semi-idle package.
  sched: remove redundant hierarchy walk in check_preempt_wakeup
2009-06-10 15:32:59 -07:00
James Morris
0b4ec6e4e0 Merge branch 'master' into next 2009-06-09 09:27:53 +10:00
KOSAKI Motohiro
bd6daba909 procfs: make errno values consistent when open pident vs exit(2) race occurs
proc_pident_instantiate() has following call flow.

proc_pident_lookup()
  proc_pident_instantiate()
    proc_pid_make_inode()

And, proc_pident_lookup() has following error handling.

	const struct pid_entry *p, *last;
	error = ERR_PTR(-ENOENT);
	if (!task)
		goto out_no_task;

Then, proc_pident_instantiate should return ENOENT too when racing against
exit(2) occur.

EINAL has two bad reason.
  - it implies caller is wrong. bad the race isn't caller's mistake.
  - man 2 open don't explain EINVAL. user often don't handle it.

Note: Other proc_pid_make_inode() caller already use ENOENT properly.

Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29 08:40:02 -07:00
James Morris
2c9e703c61 Merge branch 'master' into next
Conflicts:
	fs/exec.c

Removed IMA changes (the IMA checks are now performed via may_open()).

Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 18:40:59 +10:00
Thomas Gleixner
2d02494f5a sched, timers: cleanup avenrun users
avenrun is an rough estimate so we don't have to worry about
consistency of the three avenrun values. Remove the xtime lock
dependency and provide a function to scale the values. Cleanup the
users.

[ Impact: cleanup ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2009-05-15 15:32:45 +02:00
David Howells
107db7c7dd CRED: Guard the setprocattr security hook against ptrace
Guard the setprocattr security hook against ptrace by taking the target task's
cred_guard_mutex around it.  The problem is that setprocattr() may otherwise
note the lack of a debugger, and then perform an action on that basis whilst
letting a debugger attach between the two points.  Holding cred_guard_mutex
across the test and the action prevents ptrace_attach() from doing that.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-11 08:15:39 +10:00
Al Viro
6f5bbff9a1 Convert obvious places to deactivate_locked_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Jake Edge
f83ce3e6b0 proc: avoid information leaks to non-privileged processes
By using the same test as is used for /proc/pid/maps and /proc/pid/smaps,
only allow processes that can ptrace() a given process to see information
that might be used to bypass address space layout randomization (ASLR).
These include eip, esp, wchan, and start_stack in /proc/pid/stat as well
as the non-symbolic output from /proc/pid/wchan.

ASLR can be bypassed by sampling eip as shown by the proof-of-concept
code at http://code.google.com/p/fuzzyaslr/ As part of a presentation
(http://www.cr0.org/paper/to-jt-linux-alsr-leak.pdf) esp and wchan were
also noted as possibly usable information leaks as well.  The
start_stack address also leaks potentially useful information.

Cc: Stable Team <stable@kernel.org>
Signed-off-by: Jake Edge <jake@lwn.net>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-04 15:14:23 -07:00
KOSAKI Motohiro
00a62ce91e mm: fix Committed_AS underflow on large NR_CPUS environment
The Committed_AS field can underflow in certain situations:

>         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
>               1 Committed_AS: 18446744073709323392 kB
>              11 Committed_AS: 18446744073709455488 kB
>               6 Committed_AS:    35136 kB
>               5 Committed_AS: 18446744073709454400 kB
>               7 Committed_AS:    35904 kB
>               3 Committed_AS: 18446744073709453248 kB
>               2 Committed_AS:    34752 kB
>               9 Committed_AS: 18446744073709453248 kB
>               8 Committed_AS:    34752 kB
>               3 Committed_AS: 18446744073709320960 kB
>               7 Committed_AS: 18446744073709454080 kB
>               3 Committed_AS: 18446744073709320960 kB
>               5 Committed_AS: 18446744073709454080 kB
>               6 Committed_AS: 18446744073709320960 kB

Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
not check for underflow.

But NR_CPUS proportional isn't good calculation.  In general,
possibility of lock contention is proportional to the number of online
cpus, not theorical maximum cpus (NR_CPUS).

The current kernel has generic percpu-counter stuff.  using it is right
way.  it makes code simplify and percpu_counter_read_positive() don't
make underflow issue.

Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>		[All kernel versions]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:10 -07:00
Vitaly Mayatskikh
0816178638 pagemap: require aligned-length, non-null reads of /proc/pid/pagemap
The intention of commit aae8679b0e
("pagemap: fix bug in add_to_pagemap, require aligned-length reads of
/proc/pid/pagemap") was to force reads of /proc/pid/pagemap to be a
multiple of 8 bytes, but now it allows to read 0 bytes, which actually
puts some data to user's buffer.  According to POSIX, if count is zero,
read() should return zero and has no other results.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Cc: Thomas Tuttle <ttuttle@google.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
Martin Schwidefsky
e1c805309d [S390] /proc/stat idle field for idle cpus
The cpu idle field in the output of /proc/stat is too small for cpus
that have been idle for more than a tick. Add the architecture hook
arch_idle_time that allows to add the not accounted idle time of a
sleeping cpu without waking the cpu.

The s390 implementation of arch_idle_time uses the already existing
s390_idle_data per_cpu variable to find the sleep time of a neighboring
idle cpu.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-04-23 13:58:17 +02:00
KOSAKI Motohiro
31b07093c4 proc: mounts_poll() make consistent to mdstat_poll
In recently sysfs_poll discussion, Neil Brown pointed out /proc/mounts
also should be fixed.

SUSv3 says "Regular files shall always poll TRUE for reading and
writing".  see
http://www.opengroup.org/onlinepubs/009695399/functions/poll.html

Then, mounts_poll()'s default should be "POLLIN | POLLRDNORM".  it mean
always readable.

In addition, event trigger should use "POLLERR | POLLPRI" instead
POLLERR.  it makes consistent to mdstat_poll() and sysfs_poll(). and,
select(2) can handle POLLPRI easily.


Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-16 16:17:10 -07:00
Nobuhiro Iwamatsu
4c967291fc nommu: fix typo vma->pg_off to vma->vm_pgoff
6260a4b052 ("/proc/pid/maps: don't show
pgoff of pure ANON VMAs" had a typo.

fs/proc/task_nommu.c:138: error: 'struct vm_area_struct' has no member named 'pg_off'
distcc[21484] ERROR: compile fs/proc/task_nommu.c on sprygo/32 failed

Signed-off-by: Nobuhiro Iwamatsu <iwamatsu.nobuhiro@renesas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-08 10:21:44 -07:00
KAMEZAWA Hiroyuki
6260a4b052 /proc/pid/maps: don't show pgoff of pure ANON VMAs
Recently, it's argued that what proc/pid/maps shows is ugly when a 32bit
binary runs on 64bit host.

/proc/pid/maps outputs vma's pgoff member but vma->pgoff is of no use
information is the vma is for ANON.  With this patch, /proc/pid/maps shows
just 0 if no file backing store.

[akpm@linux-foundation.org: coding-style fixes]
[kamezawa.hiroyu@jp.fujitsu.com: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mike Waychison <mikew@google.com>
Reported-by: Ying Han <yinghan@google.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:03 -07:00
Linus Torvalds
811158b147 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
  trivial: Update my email address
  trivial: NULL noise: drivers/mtd/tests/mtd_*test.c
  trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h
  trivial: Fix misspelling of "Celsius".
  trivial: remove unused variable 'path' in alloc_file()
  trivial: fix a pdlfush -> pdflush typo in comment
  trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL
  trivial: wusb: Storage class should be before const qualifier
  trivial: drivers/char/bsr.c: Storage class should be before const qualifier
  trivial: h8300: Storage class should be before const qualifier
  trivial: fix where cgroup documentation is not correctly referred to
  trivial: Give the right path in Documentation example
  trivial: MTD: remove EOL from MODULE_DESCRIPTION
  trivial: Fix typo in bio_split()'s documentation
  trivial: PWM: fix of #endif comment
  trivial: fix typos/grammar errors in Kconfig texts
  trivial: Fix misspelling of firmware
  trivial: cgroups: documentation typo and spelling corrections
  trivial: Update contact info for Jochen Hein
  trivial: fix typo "resgister" -> "register"
  ...
2009-04-03 15:24:35 -07:00
Linus Torvalds
8fe74cf053 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  Remove two unneeded exports and make two symbols static in fs/mpage.c
  Cleanup after commit 585d3bc06f
  Trim includes of fdtable.h
  Don't crap into descriptor table in binfmt_som
  Trim includes in binfmt_elf
  Don't mess with descriptor table in load_elf_binary()
  Get rid of indirect include of fs_struct.h
  New helper - current_umask()
  check_unsafe_exec() doesn't care about signal handlers sharing
  New locking/refcounting for fs_struct
  Take fs_struct handling to new file (fs/fs_struct.c)
  Get rid of bumping fs_struct refcount in pivot_root(2)
  Kill unsharing fs_struct in __set_personality()
2009-04-02 21:09:10 -07:00
David Howells
33e5d76979 nommu: fix a number of issues with the per-MM VMA patch
Fix a number of issues with the per-MM VMA patch:

 (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
     a NOMMU system with more than 2G pages.  Makes no difference on a 32-bit
     system.

 (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
     lest it overflow.

 (3) Move the allocation of the vm_area_struct slab back for fork.c.

 (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

 (5) Use BUG_ON() rather than if () BUG().

 (6) Make the default validate_nommu_regions() a static inline rather than a
     #define.

 (7) Make free_page_series()'s objection to pages with a refcount != 1 more
     informative.

 (8) Adjust the __put_nommu_region() banner comment to indicate that the
     semaphore must be held for writing.

 (9) Limit the number of warnings about munmaps of non-mmapped regions.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:48 -07:00
Alexey Dobriyan
0f043a81eb proc tty: remove struct tty_operations::read_proc
struct tty_operations::proc_fops took it's place and there is one less
create_proc_read_entry() user now!

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:10 -07:00
Alexey Dobriyan
ae149b6bec proc tty: add struct tty_operations::proc_fops
Used for gradual switch of TTY drivers from using ->read_proc which helps
with gradual switch from ->read_proc for the whole tree.

As side effect, fix possible race condition when ->data initialized after
PDE is hooked into proc tree.

->proc_fops takes precedence over ->read_proc.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:08 -07:00
Al Viro
5ad4e53bd5 Get rid of indirect include of fs_struct.h
Don't pull it in sched.h; very few files actually need it and those
can include directly.  sched.h itself only needs forward declaration
of struct fs_struct;

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31 23:00:27 -04:00
Al Viro
498052bba5 New locking/refcounting for fs_struct
* all changes of current->fs are done under task_lock and write_lock of
  old fs->lock
* refcount is not atomic anymore (same protection)
* its decrements are done when removing reference from current; at the
  same time we decide whether to free it.
* put_fs_struct() is gone
* new field - ->in_exec.  Set by check_unsafe_exec() if we are trying to do
  execve() and only subthreads share fs_struct.  Cleared when finishing exec
  (success and failure alike).  Makes CLONE_FS fail with -EAGAIN if set.
* check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
  is in progress.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31 23:00:26 -04:00
Alexey Dobriyan
a9caa3de24 Revert "proc: revert /proc/uptime to ->read_proc hook"
This reverts commit 6c87df37dc.

proc files implemented through seq_file do pread(2) now.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-03-31 01:14:58 +04:00
Alexey Dobriyan
99b7623380 proc 2/2: remove struct proc_dir_entry::owner
Setting ->owner as done currently (pde->owner = THIS_MODULE) is racy
as correctly noted at bug #12454. Someone can lookup entry with NULL
->owner, thus not pinning enything, and release it later resulting
in module refcount underflow.

We can keep ->owner and supply it at registration time like ->proc_fops
and ->data.

But this leaves ->owner as easy-manipulative field (just one C assignment)
and somebody will forget to unpin previous/pin current module when
switching ->owner. ->proc_fops is declared as "const" which should give
some thoughts.

->read_proc/->write_proc were just fixed to not require ->owner for
protection.

rmmod'ed directories will be empty and return "." and ".." -- no harm.
And directories with tricky enough readdir and lookup shouldn't be modular.
We definitely don't want such modular code.

Removing ->owner will also make PDE smaller.

So, let's nuke it.

Kudos to Jeff Layton for reminding about this, let's say, oversight.

http://bugzilla.kernel.org/show_bug.cgi?id=12454

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-03-31 01:14:44 +04:00
Alexey Dobriyan
3dec7f59c3 proc 1/2: do PDE usecounting even for ->read_proc, ->write_proc
struct proc_dir_entry::owner is going to be removed. Now it's only necessary
to protect PDEs which are using ->read_proc, ->write_proc hooks.

However, ->owner assignments are racy and make it very easy for someone to switch
->owner on live PDE (as some subsystems do) without fixing refcounts and so on.

http://bugzilla.kernel.org/show_bug.cgi?id=12454

So, ->owner is on death row.

Proxy file operations exist already (proc_file_operations), just bump usecount
when necessary.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-03-31 01:14:27 +04:00
Milind Arun Choudhary
09729a9919 proc: fix sparse warnings in pagemap_read()
fs/proc/task_mmu.c:696:12: warning: cast removes address space of expression
fs/proc/task_mmu.c:696:9: warning: incorrect type in assignment (different address spaces)
fs/proc/task_mmu.c:696:9:    expected unsigned long long [noderef] [usertype] <asn:1>*out
fs/proc/task_mmu.c:696:9:    got unsigned long long [usertype] *<noident>
fs/proc/task_mmu.c:697:12: warning: cast removes address space of expression
fs/proc/task_mmu.c:697:9: warning: incorrect type in assignment (different address spaces)
fs/proc/task_mmu.c:697:9:    expected unsigned long long [noderef] [usertype] <asn:1>*end
fs/proc/task_mmu.c:697:9:    got unsigned long long [usertype] *<noident>
fs/proc/task_mmu.c:723:12: warning: cast removes address space of expression
fs/proc/task_mmu.c:723:26: error: subtraction of different types can't work (different address spaces)
fs/proc/task_mmu.c:725:24: error: subtraction of different types can't work (different address spaces)

Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-03-31 01:14:22 +04:00
Randy Dunlap
1681bc30f2 proc: move fs/proc/inode-alloc.txt comment into a source file
so that people will realize that it exists and can update it as needed.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-03-31 01:13:12 +04:00
Uwe Kleine-Koenig
973c32bebf trivial: fix typo "kernal" -> "kernel"
Signed-off-by: Uwe Kleine-Koenig <Uwe.Kleine-Koenig@digi.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-03-30 15:21:57 +02:00
Hugh Dickins
7c2c7d9930 fix setuid sometimes wouldn't
check_unsafe_exec() also notes whether the fs_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/cwd and /proc/<pid>/root lookups make transient
use of get_fs_struct(), which also raises that sharing count.

This might occasionally cause a setuid program not to change euid,
in the same way as happened with files->count (check_unsafe_exec
also looks at sighand->count, but /proc doesn't raise that one).

We'd prefer exec not to unshare fs_struct: so fix this in procfs,
replacing get_fs_struct() by get_fs_path(), which does path_get
while still holding task_lock, instead of raising fs->count.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
___

 fs/proc/base.c |   50 +++++++++++++++--------------------------------
 1 file changed, 16 insertions(+), 34 deletions(-)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-28 17:30:00 -07:00
Sukadev Bhattiprolu
a3ec947c85 vfs: simple_set_mnt() should return void
simple_set_mnt() is defined as returning 'int' but always returns 0.
Callers assume simple_set_mnt() never fails and don't properly cleanup if
it were to _ever_ fail.  For instance, get_sb_single() and get_sb_nodev()
should:

        up_write(sb->s_unmount);
        deactivate_super(sb);

if simple_set_mnt() fails.

Since simple_set_mnt() never fails, would be cleaner if it did not
return anything.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-27 14:44:03 -04:00
Al Viro
d72f71eb0e constify dentry_operations: procfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-27 14:44:01 -04:00
Linus Torvalds
ee568b25ee Avoid 64-bit "switch()" statements on 32-bit architectures
Commit ee6f779b9e ("filp->f_pos not
correctly updated in proc_task_readdir") changed the proc code to use
filp->f_pos directly, rather than through a temporary variable.  In the
process, that caused the operations to be done on the full 64 bits, even
though the offset is never that big.

That's all fine and dandy per se, but for some unfathomable reason gcc
generates absolutely horrid code when using 64-bit values in switch()
statements.  To the point of actually calling out to gcc helper
functions like __cmpdi2 rather than just doing the trivial comparisons
directly the way gcc does for normal compares.  At which point we get
link failures, because we really don't want to support that kind of
crazy code.

Fix this by just casting the f_pos value to "unsigned long", which
is plenty big enough for /proc, and avoids the gcc code generation issue.

Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Zhang Le <r0bertz@gentoo.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-17 10:02:35 -07:00
Zhang Le
ee6f779b9e filp->f_pos not correctly updated in proc_task_readdir
filp->f_pos only get updated at the end of the function. Thus d_off of those
dirents who are in the middle will be 0, and this will cause a problem in
glibc's readdir implementation, specifically endless loop. Because when overflow
occurs, f_pos will be set to next dirent to read, however it will be 0, unless
the next one is the last one. So it will start over again and again.

There is a sample program in man 2 gendents. This is the output of the program
running on a multithread program's task dir before this patch is applied:

  $ ./a.out /proc/3807/task
  --------------- nread=128 ---------------
  i-node#  file type  d_reclen  d_off   d_name
    506442  directory    16          1  .
    506441  directory    16          0  ..
    506443  directory    16          0  3807
    506444  directory    16          0  3809
    506445  directory    16          0  3812
    506446  directory    16          0  3861
    506447  directory    16          0  3862
    506448  directory    16          8  3863

This is the output after this patch is applied

  $ ./a.out /proc/3807/task
  --------------- nread=128 ---------------
  i-node#  file type  d_reclen  d_off   d_name
    506442  directory    16          1  .
    506441  directory    16          2  ..
    506443  directory    16          3  3807
    506444  directory    16          4  3809
    506445  directory    16          5  3812
    506446  directory    16          6  3861
    506447  directory    16          7  3862
    506448  directory    16          8  3863

Signed-off-by: Zhang Le <r0bertz@gentoo.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-16 07:51:33 -07:00
Wu Fengguang
ad3bdefe87 proc: fix kflags to uflags copying in /proc/kpageflags
Fix kpf_copy_bit(src,dst) to be kpf_copy_bit(dst,src) to match the
actual call patterns, e.g. kpf_copy_bit(kflags, KPF_LOCKED, PG_locked).

This misplacement of src/dst only affected reporting of PG_writeback,
PG_reclaim and PG_buddy. For others kflags==uflags so not affected.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-11 07:43:33 -07:00
Helge Bahmann
e07a4b9217 proc: fix PG_locked reporting in /proc/kpageflags
Expr always evaluates to zero.

Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-02-24 21:17:58 +03:00
Krzysztof Sachanowicz
cac711211a proc: proc_get_inode should de_put when inode already initialized
de_get is called before every proc_get_inode, but corresponding de_put is
called only when dropping last reference to an inode. This might cause
something like
remove_proc_entry: /proc/stats busy, count=14496
to be printed to the syslog.

The fix is to call de_put in case of an already initialized inode in
proc_get_inode.

Signed-off-by: Krzysztof Sachanowicz <analyzer1@gmail.com>
Tested-by: Marcin Pilipczuk <marcin.pilipczuk@gmail.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-23 18:25:32 -08:00
Linus Torvalds
c40f6f8bbc Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu:
  NOMMU: Support XIP on initramfs
  NOMMU: Teach kobjsize() about VMA regions.
  FLAT: Don't attempt to expand the userspace stack to fill the space allocated
  FDPIC: Don't attempt to expand the userspace stack to fill the space allocated
  NOMMU: Improve procfs output using per-MM VMAs
  NOMMU: Make mmap allocation page trimming behaviour configurable.
  NOMMU: Make VMAs per MM as for MMU-mode linux
  NOMMU: Delete askedalloc and realalloc variables
  NOMMU: Rename ARM's struct vm_region
  NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area()
2009-01-09 14:00:58 -08:00
Magnus Damm
921d58c0e6 vmcore: remove saved_max_pfn check
Remove the saved_max_pfn check from the /proc/vmcore function
read_from_oldmem().  No need to verify, we should be able to just trust
that "elfcorehdr=" is correctly passed to the crash kernel on the kernel
command line like we do with other parameters.

The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to
read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes
sense to use saved_max_pfn.  For oldmem it is used to determine when to
stop reading.  For vmcore we already have the elf header info pointing out
the physical memory regions, no need to pass the end-of- old-memory twice.

Removing the saved_max_pfn check from vmcore makes it possible for
architectures to skip oldmem but still support crash dump through vmcore -
without the need for the old saved_max_pfn cruft.

Architectures that want to play safe can do the saved_max_pfn check in
copy_oldmem_page().  Not sure why anyone would want to do that, but that's
even safer than today - the saved_max_pfn check in vmcore removed by this
patch only checks the first page.

Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Simon Horman <horms@verge.net.au>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:14 -08:00
David Howells
38f714795b NOMMU: Improve procfs output using per-MM VMAs
Improve procfs output using per-MM VMAs for process memory accounting.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
2009-01-08 12:04:47 +00:00
David Howells
8feae13110 NOMMU: Make VMAs per MM as for MMU-mode linux
Make VMAs per mm_struct as for MMU-mode linux.  This solves two problems:

 (1) In SYSV SHM where nattch for a segment does not reflect the number of
     shmat's (and forks) done.

 (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
     exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
     that a VMA might be shared and already have its vm_mm assigned to another
     process or a dead process.

A new struct (vm_region) is introduced to track a mapped region and to remember
the circumstances under which it may be shared and the vm_list_struct structure
is discarded as it's no longer required.

This patch makes the following additional changes:

 (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
     with no recourse to __GFP_COMP, so the pages are not composite.  Instead,
     each page has a reference on it held by the region.  Anything else that is
     interested in such a page will have to get a reference on it to retain it.
     When the pages are released due to unmapping, each page is passed to
     put_page() and will be freed when the page usage count reaches zero.

 (2) Excess pages are trimmed after an allocation as the allocation must be
     made as a power-of-2 quantity of pages.

 (3) VMAs are added to the parent MM's R/B tree and mmap lists.  As an MM may
     end up with overlapping VMAs within the tree, the VMA struct address is
     appended to the sort key.

 (4) Non-anonymous VMAs are now added to the backing inode's prio list.

 (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
     the backing region.  The VMA and region structs will be split if
     necessary.

 (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
     segment instead of all the attachments at that addresss.  Multiple
     shmat()'s return the same address under NOMMU-mode instead of different
     virtual addresses as under MMU-mode.

 (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.

 (8) /proc/maps is now the global list of mapped regions, and may list bits
     that aren't actually mapped anywhere.

 (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
     of RAM currently allocated by mmap to hold mappable regions that can't be
     mapped directly.  These are copies of the backing device or file if not
     anonymous.

These changes make NOMMU mode more similar to MMU mode.  The downside is that
NOMMU mode requires some extra memory to track things over NOMMU without this
patch (VMAs are no longer shared, and there are now region structs).

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
2009-01-08 12:04:47 +00:00
Linus Torvalds
a0c9f240a9 Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc
* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc:
  proc: remove write-only variable in proc_pident_lookup()
  proc: fix sparse warning
  proc: add /proc/*/stack
  proc: remove '##' usage
  proc: remove useless WARN_ONs
  proc: stop using BKL
2009-01-07 12:01:06 -08:00
Linus Torvalds
57c44c5f6f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
  trivial: chack -> check typo fix in main Makefile
  trivial: Add a space (and a comma) to a printk in 8250 driver
  trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx
  trivial: Fix misspelling of "firmware" in powerpc Makefile
  trivial: Fix misspelling of "firmware" in usb.c
  trivial: Fix misspelling of "firmware" in qla1280.c
  trivial: Fix misspelling of "firmware" in a100u2w.c
  trivial: Fix misspelling of "firmware" in megaraid.c
  trivial: Fix misspelling of "firmware" in ql4_mbx.c
  trivial: Fix misspelling of "firmware" in acpi_memhotplug.c
  trivial: Fix misspelling of "firmware" in ipw2100.c
  trivial: Fix misspelling of "firmware" in atmel.c
  trivial: Fix misspelled firmware in Kconfig
  trivial: fix an -> a typos in documentation and comments
  trivial: fix then -> than typos in comments and documentation
  trivial: update Jesper Juhl CREDITS entry with new email
  trivial: fix singal -> signal typo
  trivial: Fix incorrect use of "loose" in event.c
  trivial: printk: fix indentation of new_text_line declaration
  trivial: rtc-stk17ta8: fix sparse warning
  ...
2009-01-07 11:31:52 -08:00
Mel Gorman
3340289ddf mm: report the MMU pagesize in /proc/pid/smaps
The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
kernel to back a VMA.  This matches the size used by the MMU in the
majority of cases.  However, one counter-example occurs on PPC64 kernels
whereby a kernel using 64K as a base pagesize may still use 4K pages for
the MMU on older processor.  To distinguish, this patch reports
MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Mel Gorman
08fba69986 mm: report the pagesize backing a VMA in /proc/pid/smaps
It is useful to verify a hugepage-aware application is using the expected
pagesizes for its memory regions. This patch creates an entry called
KernelPageSize in /proc/pid/smaps that is the size of page used by the
kernel to back a VMA. The entry is not called PageSize as it is possible
the MMU uses a different size. This extension should not break any sensible
parser that skips lines containing unrecognised information.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Frederik Schwarzer
025dfdafe7 trivial: fix then -> than typos in comments and documentation
- (better, more, bigger ...) then -> (...) than

Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-01-06 11:28:06 +01:00
Al Viro
56ff5efad9 zero i_uid/i_gid on inode allocation
... and don't bother in callers.  Don't bother with zeroing i_blocks,
while we are at it - it's already been zeroed.

i_mode is not worth the effort; it has no common default value.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-01-05 11:54:28 -05:00
WANG Cong
230e40fbda proc: remove write-only variable in proc_pident_lookup()
Signed-off-by: WANG Cong <wangcong@zeuux.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-05 12:27:45 +03:00
Hannes Eder
dfe6b7d940 proc: fix sparse warning
fs/proc/base.c:312:4: warning: do-while statement is not a compound statement

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-05 12:27:45 +03:00
Ken Chen
2ec220e27f proc: add /proc/*/stack
/proc/*/stack adds the ability to query a task's stack trace. It is more
useful than /proc/*/wchan as it provides full stack trace instead of single
depth. Example output:

	$ cat /proc/self/stack
	[<c010a271>] save_stack_trace_tsk+0x17/0x35
	[<c01827b4>] proc_pid_stack+0x4a/0x76
	[<c018312d>] proc_single_show+0x4a/0x5e
	[<c016bdec>] seq_read+0xf3/0x29f
	[<c015a004>] vfs_read+0x6d/0x91
	[<c015a0c1>] sys_read+0x3b/0x60
	[<c0102eda>] syscall_call+0x7/0xb
	[<ffffffff>] 0xffffffff

[add save_stack_trace_tsk() on mips, ACK Ralf --adobriyan]
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-05 12:27:44 +03:00
Alexey Dobriyan
631f9c1868 proc: remove '##' usage
Inability to jump to /proc/*/foo handlers with ctags is annoying.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-05 12:27:44 +03:00
Alexey Dobriyan
ecae934edc proc: remove useless WARN_ONs
NULL "struct inode *" means VFS passed NULL inode to ->open.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-05 12:27:44 +03:00
Alexey Dobriyan
b4df2b92d8 proc: stop using BKL
There are four BKL users in proc: de_put(), proc_lookup_de(),
proc_readdir_de(), proc_root_readdir(),

1) de_put()
-----------
de_put() is classic atomic_dec_and_test() refcount wrapper -- no BKL
needed. BKL doesn't matter to possible refcount leak as well.

2) proc_lookup_de()
-------------------
Walking PDE list is protected by proc_subdir_lock(), proc_get_inode() is
potentially blocking, all callers of proc_lookup_de() eventually end up
from ->lookup hooks which is protected by directory's ->i_mutex -- BKL
doesn't protect anything.

3) proc_readdir_de()
--------------------
"." and ".." part doesn't need BKL, walking PDE list is under
proc_subdir_lock, calling filldir callback is potentially blocking
because it writes to luserspace. All proc_readdir_de() callers
eventually come from ->readdir hook which is under directory's
->i_mutex -- BKL doesn't protect anything.

4) proc_root_readdir_de()
-------------------------
proc_root_readdir_de is ->readdir hook, see (3).

Since readdir hooks doesn't use BKL anymore, switch to
generic_file_llseek, since it also takes directory's i_mutex.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-05 12:27:44 +03:00
Linus Torvalds
db200df0b3 Merge branch 'irq-fixes-for-linus-4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-fixes-for-linus-4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sparseirq: move __weak symbols into separate compilation unit
  sparseirq: work around __weak alias bug
  sparseirq: fix hang with !SPARSE_IRQ
  sparseirq: set lock_class for legacy irq when sparse_irq is selected
  sparseirq: work around compiler optimizing away __weak functions
  sparseirq: fix desc->lock init
  sparseirq: do not printk when migrating IRQ descriptors
  sparseirq: remove duplicated arch_early_irq_init()
  irq: simplify for_each_irq_desc() usage
  proc: remove ifdef CONFIG_SPARSE_IRQ from stat.c
  irq: for_each_irq_desc() move to irqnr.h
  hrtimer: remove #include <linux/irq.h>
2008-12-31 09:00:59 -08:00
Linus Torvalds
179475a3b4 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, sparseirq: clean up Kconfig entry
  x86: turn CONFIG_SPARSE_IRQ off by default
  sparseirq: fix numa_migrate_irq_desc dependency and comments
  sparseirq: add kernel-doc notation for new member in irq_desc, -v2
  locking, irq: enclose irq_desc_lock_class in CONFIG_LOCKDEP
  sparseirq, xen: make sure irq_desc is allocated for interrupts
  sparseirq: fix !SMP building, #2
  x86, sparseirq: move irq_desc according to smp_affinity, v7
  proc: enclose desc variable of show_stat() in CONFIG_SPARSE_IRQ
  sparse irqs: add irqnr.h to the user headers list
  sparse irqs: handle !GENIRQ platforms
  sparseirq: fix !SMP && !PCI_MSI && !HT_IRQ build
  sparseirq: fix Alpha build failure
  sparseirq: fix typo in !CONFIG_IO_APIC case
  x86, MSI: pass irq_cfg and irq_desc
  x86: MSI start irq numbering from nr_irqs_gsi
  x86: use NR_IRQS_LEGACY
  sparse irq_desc[] array: core kernel and x86 changes
  genirq: record IRQ_LEVEL in irq_desc[]
  irq.h: remove padding from irq_desc on 64bits
2008-12-30 16:20:19 -08:00
Linus Torvalds
3c92ec8ae9 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (144 commits)
  powerpc/44x: Support 16K/64K base page sizes on 44x
  powerpc: Force memory size to be a multiple of PAGE_SIZE
  powerpc/32: Wire up the trampoline code for kdump
  powerpc/32: Add the ability for a classic ppc kernel to be loaded at 32M
  powerpc/32: Allow __ioremap on RAM addresses for kdump kernel
  powerpc/32: Setup OF properties for kdump
  powerpc/32/kdump: Implement crash_setup_regs() using ppc_save_regs()
  powerpc: Prepare xmon_save_regs for use with kdump
  powerpc: Remove default kexec/crash_kernel ops assignments
  powerpc: Make default kexec/crash_kernel ops implicit
  powerpc: Setup OF properties for ppc32 kexec
  powerpc/pseries: Fix cpu hotplug
  powerpc: Fix KVM build on ppc440
  powerpc/cell: add QPACE as a separate Cell platform
  powerpc/cell: fix build breakage with CONFIG_SPUFS disabled
  powerpc/mpc5200: fix error paths in PSC UART probe function
  powerpc/mpc5200: add rts/cts handling in PSC UART driver
  powerpc/mpc5200: Make PSC UART driver update serial errors counters
  powerpc/mpc5200: Remove obsolete code from mpc5200 MDIO driver
  powerpc/mpc5200: Add MDMA/UDMA support to MPC5200 ATA driver
  ...

Fix trivial conflict in drivers/char/Makefile as per Paul's directions
2008-12-28 16:54:33 -08:00
Linus Torvalds
a39b863342 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
  sched: fix warning in fs/proc/base.c
  schedstat: consolidate per-task cpu runtime stats
  sched: use RCU variant of list traversal in for_each_leaf_rt_rq()
  sched, cpuacct: export percpu cpuacct cgroup stats
  sched, cpuacct: refactoring cpuusage_read / cpuusage_write
  sched: optimize update_curr()
  sched: fix wakeup preemption clock
  sched: add missing arch_update_cpu_topology() call
  sched: let arch_update_cpu_topology indicate if topology changed
  sched: idle_balance() does not call load_balance_newidle()
  sched: fix sd_parent_degenerate on non-numa smp machine
  sched: add uid information to sched_debug for CONFIG_USER_SCHED
  sched: move double_unlock_balance() higher
  sched: update comment for move_task_off_dead_cpu
  sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
  sched/rt: removed unneeded defintion
  sched: add hierarchical accounting to cpu accounting controller
  sched: include group statistics in /proc/sched_debug
  sched: rename SCHED_NO_NO_OMIT_FRAME_POINTER => SCHED_OMIT_FRAME_POINTER
  sched: clean up SCHED_CPUMASK_ALLOC
  ...
2008-12-28 12:27:58 -08:00
KOSAKI Motohiro
26ddd8d5ca proc: remove ifdef CONFIG_SPARSE_IRQ from stat.c
Impact: cleanup

irq_desc can be NULL when CONFIG_SPARSE_IRQ=y only.
therefore, NULL checking can move into kstat_irqs_cpu() of SPARSE_IRQ version.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: "Yinghai Lu" <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-26 09:48:18 +01:00
Ingo Molnar
860cf8894b Merge branches 'irq/sparseirq', 'irq/genirq' and 'irq/urgent'; commit 'v2.6.28' into irq/core 2008-12-25 16:27:54 +01:00
James Morris
cbacc2c7f0 Merge branch 'next' into for-linus 2008-12-25 11:40:09 +11:00
Ingo Molnar
826e08b015 sched: fix warning in fs/proc/base.c
Stephen Rothwell reported this new (harmless) build warning on platforms that
define u64 to long:

 fs/proc/base.c: In function 'proc_pid_schedstat':
 fs/proc/base.c:352: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'u64'

asm-generic/int-l64.h platforms strike again: that file should be eliminated.

Fix it by casting the parameters to long long.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-22 07:41:06 +01:00
Ken Chen
9c2c48020e schedstat: consolidate per-task cpu runtime stats
Impact: simplify code

When we turn on CONFIG_SCHEDSTATS, per-task cpu runtime is accumulated
twice. Once in task->se.sum_exec_runtime and once in sched_info.cpu_time.
These two stats are exactly the same.

Given that task->se.sum_exec_runtime is always accumulated by the core
scheduler, sched_info can reuse that data instead of duplicate the accounting.

Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-18 13:54:01 +01:00
KOSAKI Motohiro
13bd41bc22 proc: enclose desc variable of show_stat() in CONFIG_SPARSE_IRQ
Impact: restructure code to fix compiler warning

commit 240d367b4e moved desc usage point
into #ifdef CONFIG_SPARSE_IRQ.

Eliminate the desc variable, otherwise following warning happens:

 fs/proc/stat.c: In function 'show_stat':
 fs/proc/stat.c:31: warning: unused variable 'desc'

[ akpm: cleaned up the patch to remove #ifdef ]

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 11:24:14 +01:00
Anton Vorontsov
6b82b3e4b5 powerpc: Remove `have_of' global variable
The `have_of' variable is a relic from the arch/ppc time, it isn't
useful nowadays.

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-12-16 15:52:57 +11:00
Hugh Dickins
9c24624727 KSYM_SYMBOL_LEN fixes
Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked
to my 966c8c12dc sprint_symbol(): use
less stack exposing a bug in slub's list_locations() -
kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was
beyond the end of page provided.

The 100 slop which list_locations() allows at end of page looks roughly
enough for all the other stuff it might print after the symbol before
it checks again: break out KSYM_SYMBOL_LEN earlier than before.

Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they
need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer
where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies
them.

[akpm@linux-foundation.org: ftrace.h needs module.h]
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc Miles Lane <miles.lane@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10 08:01:54 -08:00
Matt Mackall
49c50342c7 pagemap: fix 32-bit pagemap regression
The large pages fix from bcf8039ed4 broke 32-bit pagemap by pulling the
pagemap entry code out into a function with the wrong return type.
Pagemap entries are 64 bits on all systems and unsigned long is only 32
bits on 32-bit systems.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Reported-by: Doug Graham <dgraham@nortel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: <stable@kernel.org>		[2.6.26.x, 2.6.27.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10 08:01:53 -08:00
Yinghai Lu
240d367b4e sparseirq: fix Alpha build failure
Impact: build fix on Alpha

-tip testing found this build failure on the Alpha defconfig:

/home/mingo/tip/fs/proc/stat.c: In function 'show_stat':
/home/mingo/tip/fs/proc/stat.c:48: error: implicit declaration of function 'for_each_irq_desc'
/home/mingo/tip/fs/proc/stat.c:48: error: expected ';' before '{' token

can not use irq_desc() in stat.c on older architectures.

Signed-off-by: Yinghai Lu <yinghai@kernel.orgg>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-09 04:16:54 +01:00
Yinghai Lu
0b8f1efad3 sparse irq_desc[] array: core kernel and x86 changes
Impact: new feature

Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
NR_CPUS set to large values. The goal is to be able to scale up to much
larger NR_IRQS value without impacting the (important) common case.

To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
irq_desc pointers.

When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
this also makes the IRQ descriptors NUMA-local (to the site that calls
request_irq()).

This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
uses desc->chip_data for x86 to store irq_cfg.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-08 14:31:51 +01:00
James Morris
f3a5c54701 Merge branch 'master' into next
Conflicts:
	fs/cifs/misc.c

Merge to resolve above, per the patch below.

Signed-off-by: James Morris <jmorris@namei.org>

diff --cc fs/cifs/misc.c
index ec36410,addd1dc..0000000
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@@ -347,13 -338,13 +338,13 @@@ header_assemble(struct smb_hdr *buffer
  		/*  BB Add support for establishing new tCon and SMB Session  */
  		/*      with userid/password pairs found on the smb session   */
  		/*	for other target tcp/ip addresses 		BB    */
 -				if (current->fsuid != treeCon->ses->linux_uid) {
 +				if (current_fsuid() != treeCon->ses->linux_uid) {
  					cFYI(1, ("Multiuser mode and UID "
  						 "did not match tcon uid"));
- 					read_lock(&GlobalSMBSeslock);
- 					list_for_each(temp_item, &GlobalSMBSessionList) {
- 						ses = list_entry(temp_item, struct cifsSesInfo, cifsSessionList);
+ 					read_lock(&cifs_tcp_ses_lock);
+ 					list_for_each(temp_item, &treeCon->ses->server->smb_ses_list) {
+ 						ses = list_entry(temp_item, struct cifsSesInfo, smb_ses_list);
 -						if (ses->linux_uid == current->fsuid) {
 +						if (ses->linux_uid == current_fsuid()) {
  							if (ses->server == treeCon->ses->server) {
  								cFYI(1, ("found matching uid substitute right smb_uid"));
  								buffer->Uid = ses->Suid;
2008-11-18 18:52:37 +11:00
Al Viro
5c06fe772d Fix broken ownership of /proc/sys/ files
D'oh...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-and-tested-by: Peter Palfrader <peter@palfrader.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-16 15:09:52 -08:00
David Howells
c69e8d9c01 CRED: Use RCU to access another task's creds and to release a task's own creds
Use RCU to access another task's creds and to release a task's own creds.
This means that it will be possible for the credentials of a task to be
replaced without another task (a) requiring a full lock to read them, and (b)
seeing deallocated memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:19 +11:00
David Howells
b6dff3ec5e CRED: Separate task security context from task_struct
Separate the task security context from task_struct.  At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:16 +11:00
Linus Torvalds
c8126cc602 Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc
* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc:
  proc: revert /proc/uptime to ->read_proc hook
2008-11-03 09:59:01 -08:00
Alexey Dobriyan
6c87df37dc proc: revert /proc/uptime to ->read_proc hook
Turned out some VMware userspace does pread(2) on /proc/uptime, but
seqfiles currently don't allow pread() resulting in -ESPIPE.

Seqfiles in theory can do pread(), but this can be a long story,
so revert to ->read_proc until then.

http://bugzilla.kernel.org/show_bug.cgi?id=11856

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-27 22:56:56 +03:00
Alan Cox
526719ba51 Switch to a valid email address...
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-27 08:40:17 -07:00
Linus Torvalds
88ed86fee6 Merge branch 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc
* 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: (35 commits)
  proc: remove fs/proc/proc_misc.c
  proc: move /proc/vmcore creation to fs/proc/vmcore.c
  proc: move pagecount stuff to fs/proc/page.c
  proc: move all /proc/kcore stuff to fs/proc/kcore.c
  proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
  proc: move /proc/modules boilerplate to kernel/module.c
  proc: move /proc/diskstats boilerplate to block/genhd.c
  proc: move /proc/zoneinfo boilerplate to mm/vmstat.c
  proc: move /proc/vmstat boilerplate to mm/vmstat.c
  proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c
  proc: move /proc/buddyinfo boilerplate to mm/vmstat.c
  proc: move /proc/vmallocinfo to mm/vmalloc.c
  proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c
  proc: move /proc/slab_allocators boilerplate to mm/slab.c
  proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c
  proc: move /proc/stat to fs/proc/stat.c
  proc: move rest of /proc/partitions code to block/genhd.c
  proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c
  proc: move /proc/devices code to fs/proc/devices.c
  proc: move rest of /proc/locks to fs/locks.c
  ...
2008-10-23 12:04:37 -07:00
Linus Torvalds
2248485640 Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
  [PATCH] kill the rest of struct file propagation in block ioctls
  [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
  [PATCH] get rid of blkdev_locked_ioctl()
  [PATCH] get rid of blkdev_driver_ioctl()
  [PATCH] sanitize blkdev_get() and friends
  [PATCH] remember mode of reiserfs journal
  [PATCH] propagate mode through swsusp_close()
  [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
  [PATCH] pass fmode_t to blkdev_put()
  [PATCH] kill the unused bsize on the send side of /dev/loop
  [PATCH] trim file propagation in block/compat_ioctl.c
  [PATCH] end of methods switch: remove the old ones
  [PATCH] switch sr
  [PATCH] switch sd
  [PATCH] switch ide-scsi
  [PATCH] switch tape_block
  [PATCH] switch dcssblk
  [PATCH] switch dasd
  [PATCH] switch mtd_blkdevs
  [PATCH] switch mmc
  ...
2008-10-23 10:23:07 -07:00
Alexey Dobriyan
59c7572e82 proc: remove fs/proc/proc_misc.c
Now that everything was moved to their more or less expected places,
apply rm(1).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 18:54:05 +04:00
Alexey Dobriyan
5aa140c2de proc: move /proc/vmcore creation to fs/proc/vmcore.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 18:51:22 +04:00
Alexey Dobriyan
6d80e53f00 proc: move pagecount stuff to fs/proc/page.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 18:48:31 +04:00
Alexey Dobriyan
97ce5d6dcb proc: move all /proc/kcore stuff to fs/proc/kcore.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 18:32:38 +04:00
Alexey Dobriyan
b5aadf7f14 proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 18:06:12 +04:00
Alexey Dobriyan
3b5d5c6b0c proc: move /proc/modules boilerplate to kernel/module.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 18:03:13 +04:00
Alexey Dobriyan
31d85ab28e proc: move /proc/diskstats boilerplate to block/genhd.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-23 17:57:37 +04:00
Alexey Dobriyan
5c9fe6281b proc: move /proc/zoneinfo boilerplate to mm/vmstat.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
2008-10-23 17:35:04 +04:00
Alexey Dobriyan
b6aa44ab69 proc: move /proc/vmstat boilerplate to mm/vmstat.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
2008-10-23 17:12:51 +04:00
Alexey Dobriyan
74e2e8e8ce proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 16:33:29 +04:00
Alexey Dobriyan
8f32f7e5ac proc: move /proc/buddyinfo boilerplate to mm/vmstat.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 16:12:04 +04:00
Alexey Dobriyan
5f6a6a9c4e proc: move /proc/vmallocinfo to mm/vmalloc.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
2008-10-23 15:48:28 +04:00
Alexey Dobriyan
7b3c3a50a3 proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c
Lose dummy ->write hook in case of SLUB, it's possible now.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-10-23 15:20:06 +04:00
Alexey Dobriyan
a0ec95a8e6 proc: move /proc/slab_allocators boilerplate to mm/slab.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-10-23 15:17:27 +04:00
Alexey Dobriyan
d6917e19f3 proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 15:15:46 +04:00
Alexey Dobriyan
df8106dbb5 proc: move /proc/stat to fs/proc/stat.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 15:14:05 +04:00
Alexey Dobriyan
f500975a3f proc: move rest of /proc/partitions code to block/genhd.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-23 15:07:31 +04:00
Alexey Dobriyan
8591cf4322 proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 15:05:11 +04:00
Alexey Dobriyan
fe2510426a proc: move /proc/devices code to fs/proc/devices.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 15:02:18 +04:00
Alexey Dobriyan
d8ba7a3633 proc: move rest of /proc/locks to fs/locks.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:37:00 +04:00
Alexey Dobriyan
ae048112c0 proc: move /proc/kmsg creation to fs/proc/kmsg.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:35:08 +04:00
Alexey Dobriyan
659689280a proc: remove remnants of ->read_proc in proc_misc.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:32:59 +04:00
Alexey Dobriyan
6e62775ece proc: move /proc/execdomains to kernel/exec_domain.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:30:41 +04:00
Alexey Dobriyan
cf9887f102 proc: switch /proc/cmdline to seq_file
and move it to fs/proc/cmdline.c while I'm at it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:29:04 +04:00
Alexey Dobriyan
6827400713 proc: move /proc/filesystems to fs/filesystems.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:27:09 +04:00
Alexey Dobriyan
4c150f6c30 proc: move /proc/stram to m68k-specific code
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:25:35 +04:00
Alexey Dobriyan
813dcf7a6e proc: move /proc/hardware to m68k-specific code
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:24:03 +04:00
Alexey Dobriyan
b457d15161 proc: switch /proc/version to seq_file
and move it to fs/proc/version.c while I'm at it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:19:58 +04:00
Alexey Dobriyan
e1759c215b proc: switch /proc/meminfo to seq_file
and move it to fs/proc/meminfo.c while I'm at it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 13:52:40 +04:00
Alexey Dobriyan
9617760287 proc: switch /proc/uptime to seq_file
and move it to fs/proc/uptime.c while I'm at it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 13:48:01 +04:00
Alexey Dobriyan
5b3acc8de8 proc: switch /proc/loadavg to seq_file
and move it to fs/proc/loadavg.c while I'm at it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 13:45:28 +04:00
Arjan van de Ven
6c2f91e077 proc: use WARN() rather than printk+backtrace
Use WARN() rather than a printk() + backtrace();
this gives a more standard format message as well as complete
information (including line numbers etc) that will be collected
by kerneloops.org

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 13:34:38 +04:00
Alexey Dobriyan
1e0edd3f67 proc: spread __init
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 13:32:31 +04:00
Alexey Dobriyan
5bcd7ff9e1 proc: proc_init_inodecache() can't fail
kmem_cache creation code will panic, don't return anything.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 13:27:50 +04:00
Joe Korty
7c88db0cb5 proc: fix vma display mismatch between /proc/pid/{maps,smaps}
Commit 4752c36978 aka
"maps4: simplify interdependence of maps and smaps" broke /proc/pid/smaps,
causing it to display some vmas twice and other vmas not at all.  For example:

    grep .- /proc/1/smaps >/tmp/smaps; diff /proc/1/maps /tmp/smaps

    1  25d24
    2  < 7fd7e23aa000-7fd7e23ac000 rw-p 7fd7e23aa000 00:00 0
    3  28a28
    4  > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

The bug has something to do with setting m->version before all the
seq_printf's have been performed.  show_map was doing this correctly,
but show_smap was doing this in the middle of its seq_printf sequence.
This patch arranges things so that the setting of m->version in show_smap
is also done at the end of its seq_printf sequence.

Testing: in addition to the above grep test, for each process I summed
up the 'Rss' fields of /proc/pid/smaps and compared that to the 'VmRSS'
field of /proc/pid/status.  All matched except for Xorg (which has a
/dev/mem mapping which Rss accounts for but VmRSS does not).  This result
gives us some confidence that neither /proc/pid/maps nor /proc/pid/smaps
are any longer skipping or double-counting vmas.

Signed-off-by: Joe Korty <joe.korty@ccur.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 13:21:29 +04:00
Miklos Szeredi
f696a3659f [PATCH] move executable checking into ->permission()
For execute permission on a regular files we need to check if file has
any execute bits at all, regardless of capabilites.

This check is normally performed by generic_permission() but was also
added to the case when the filesystem defines its own ->permission()
method.  In the latter case the filesystem should be responsible for
performing this check.

Move the check from inode_permission() inside filesystems which are
not calling generic_permission().

Create a helper function execute_ok() that returns true if the inode
is a directory or if any execute bits are present in i_mode.

Also fix up the following code:

 - coda control file is never executable
 - sysctl files are never executable
 - hfs_permission seems broken on MAY_EXEC, remove
 - hfsplus_permission is eqivalent to generic_permission(), remove

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2008-10-23 05:13:25 -04:00
Christoph Hellwig
3222a3e55f [PATCH] fix ->llseek for more directories
With this patch all directory fops instances that have a readdir
that doesn't take the BKL are switched to generic_file_llseek.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-10-23 05:13:21 -04:00
Al Viro
aeb5d72706 [PATCH] introduce fmode_t, do annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:06 -04:00
Linus Torvalds
9301975ec2 Merge branch 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu
and x86/uv.

The sparseirq branch is just preliminary groundwork: no sparse IRQs are
actually implemented by this tree anymore - just the new APIs are added
while keeping the old way intact as well (the new APIs map 1:1 to
irq_desc[]).  The 'real' sparse IRQ support will then be a relatively
small patch ontop of this - with a v2.6.29 merge target.

* 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits)
  genirq: improve include files
  intr_remapping: fix typo
  io_apic: make irq_mis_count available on 64-bit too
  genirq: fix name space collisions of nr_irqs in arch/*
  genirq: fix name space collision of nr_irqs in autoprobe.c
  genirq: use iterators for irq_desc loops
  proc: fixup irq iterator
  genirq: add reverse iterator for irq_desc
  x86: move ack_bad_irq() to irq.c
  x86: unify show_interrupts() and proc helpers
  x86: cleanup show_interrupts
  genirq: cleanup the sparseirq modifications
  genirq: remove artifacts from sparseirq removal
  genirq: revert dynarray
  genirq: remove irq_to_desc_alloc
  genirq: remove sparse irq code
  genirq: use inline function for irq_to_desc
  genirq: consolidate nr_irqs and for_each_irq_desc()
  x86: remove sparse irq from Kconfig
  genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n
  ...
2008-10-20 13:23:01 -07:00
Linus Torvalds
99ebcf8285 Merge branch 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
  fix documentation of sysrq-q really
  Fix documentation of sysrq-q
  timer_list: add base address to clock base
  timer_list: print cpu number of clockevents device
  timer_list: print real timer address
  NOHZ: restart tick device from irq_enter()
  NOHZ: split tick_nohz_restart_sched_tick()
  NOHZ: unify the nohz function calls in irq_enter()
  timers: fix itimer/many thread hang, fix
  timers: fix itimer/many thread hang, v3
  ntp: improve adjtimex frequency rounding
  timekeeping: fix rounding problem during clock update
  ntp: let update_persistent_clock() sleep
  hrtimer: reorder struct hrtimer to save 8 bytes on 64bit builds
  posix-timers: lock_timer: make it readable
  posix-timers: lock_timer: kill the bogus ->it_id check
  posix-timers: kill ->it_sigev_signo and ->it_sigev_value
  posix-timers: sys_timer_create: cleanup the error handling
  posix-timers: move the initialization of timer->sigq from send to create path
  posix-timers: sys_timer_create: simplify and s/tasklist/rcu/
  ...

Fix trivial conflicts due to sysrq-q description clahes in
Documentation/sysrq.txt and drivers/char/sysrq.c
2008-10-20 13:19:56 -07:00
Simon Horman
85a0ee342e kdump: add is_vmcore_usable() and vmcore_unusable()
The usage of elfcorehdr_addr has changed recently such that being set to
ELFCORE_ADDR_MAX is used by is_kdump_kernel() to indicate if the code is
executing in a kernel executed as a crash kernel.

However, arch/ia64/kernel/setup.c:reserve_elfcorehdr will rest
elfcorehdr_addr to ELFCORE_ADDR_MAX on error, which means any subsequent
calls to is_kdump_kernel() will return 0, even though they should return
1.

Ok, at this point in time there are no subsequent calls, but I think its
fair to say that there is ample scope for error or at the very least
confusion.

This patch add an extra state, ELFCORE_ADDR_ERR, which indicates that
elfcorehdr_addr was passed on the command line, and thus execution is
taking place in a crashdump kernel, but vmcore can't be used for some
reason.  This is tested for using is_vmcore_usable() and set using
vmcore_unusable().  A subsequent patch makes use of this new code.

To summarise, the states that elfcorehdr_addr can now be in are as follows:

ELFCORE_ADDR_MAX: not a crashdump kernel
ELFCORE_ADDR_ERR: crashdump kernel but vmcore is unusable
any other value:  crash dump kernel and vmcore is usable

Signed-off-by: Simon Horman <horms@verge.net.au>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:40 -07:00
Vivek Goyal
57cac4d188 kdump: make elfcorehdr_addr independent of CONFIG_PROC_VMCORE
o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE
  but also by the code which is not inside CONFIG_PROC_VMCORE.  For
  example, is_kdump_kernel() is used by powerpc code to determine if
  kernel is booting after a panic then use previous kernel's TCE table.
  So even if CONFIG_PROC_VMCORE is not set in second kernel, one should be
  able to correctly determine that we are booting after a panic and setup
  calgary iommu accordingly.

o So remove the assumption that elfcorehdr_addr is under
  CONFIG_PROC_VMCORE.

o Move definition of elfcorehdr_addr to arch dependent crash files.
  (Unfortunately crash dump does not have an arch independent file
  otherwise that would have been the best place).

o kexec.c is not the right place as one can Have CRASH_DUMP enabled in
  second kernel without KEXEC being enabled.

o I don't see sh setup code parsing the command line for
  elfcorehdr_addr.  I am wondering how does vmcore interface work on sh.
  Anyway, I am atleast defining elfcoredhr_addr so that compilation is not
  broken on sh.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Simon Horman <horms@verge.net.au>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Nick Piggin
5344b7e648 vmstat: mlocked pages statistics
Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:31 -07:00
Lee Schermerhorn
7b854121eb Unevictable LRU Page Statistics
Report unevictable pages per zone and system wide.

Kosaki Motohiro added support for memory controller unevictable
statistics.

[riel@redhat.com: fix printk in show_free_areas()]
[akpm@linux-foundation.org: fix units in /proc/vmstats]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:26 -07:00
Rik van Riel
4f98a2fee8 vmscan: split LRU lists into anon & file sets
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon").  The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists.  The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:25 -07:00
Thomas Gleixner
c465a76af6 Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus 2008-10-20 13:14:06 +02:00
Alexey Dobriyan
f40cbaa5b0 proc: move sysrq-trigger out of fs/proc/
Move it into sysrq.c, along with the rest of the sysrq implementation.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:47 -07:00
Thomas Gleixner
2be3b52a57 proc: fixup irq iterator
There is no need for irq_desc here. Even for sparse_irq we can
handle this clever in for_each_irq_nr().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-16 16:53:30 +02:00
Thomas Gleixner
2cc21ef843 genirq: remove sparse irq code
This code is not ready, but we need to rip it out instead of rebasing
as we would lose the APIC/IO_APIC unification otherwise.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-16 16:53:15 +02:00
Yinghai Lu
6d50bc2683 x86: use 28 bits irq NR for pci msi/msix and ht
also print out irq no in /proc/interrups and /proc/stat in hex, so could
tell bus/dev/func.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:52 +02:00
Yinghai Lu
52b17329d6 x86_64: make /proc/interrupts work with dyn irq_desc
loop with irq_desc list

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:51 +02:00
Yinghai Lu
c7fb03a475 irq, fs/proc: replace loop with nr_irqs for proc/stat
Replace another nr_irqs loop to avoid the allocation of all sparse
irq entries - use for_each_irq_desc instead.

v2: make sure arch without GENERIC_HARDIRQS works too

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:33 +02:00
Yinghai Lu
7f95ec9e4c x86: move kstat_irqs from kstat to irq_desc
based on Eric's patch ...

together mold it with dyn_array for irq_desc, will allcate kstat_irqs for
nr_irq_desc alltogether if needed. -- at that point nr_cpus is known already.

v2: make sure system without generic_hardirqs works they don't have irq_desc
v3: fix merging
v4: [mingo@elte.hu] fix typo

[ mingo@elte.hu ] irq: build fix

fix:

 arch/x86/xen/spinlock.c: In function 'xen_spin_lock_slow':
 arch/x86/xen/spinlock.c:90: error: 'struct kernel_stat' has no member named 'irqs'

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:32 +02:00
Yinghai Lu
da27c118eb fs/proc: use nr_irqs
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:07 +02:00
Linus Torvalds
8acd3a60bc Merge branch 'for-2.6.28' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.28' of git://linux-nfs.org/~bfields/linux: (59 commits)
  svcrdma: Fix IRD/ORD polarity
  svcrdma: Update svc_rdma_send_error to use DMA LKEY
  svcrdma: Modify the RPC reply path to use FRMR when available
  svcrdma: Modify the RPC recv path to use FRMR when available
  svcrdma: Add support to svc_rdma_send to handle chained WR
  svcrdma: Modify post recv path to use local dma key
  svcrdma: Add a service to register a Fast Reg MR with the device
  svcrdma: Query device for Fast Reg support during connection setup
  svcrdma: Add FRMR get/put services
  NLM: Remove unused argument from svc_addsock() function
  NLM: Remove "proto" argument from lockd_up()
  NLM: Always start both UDP and TCP listeners
  lockd: Remove unused fields in the nlm_reboot structure
  lockd: Add helper to sanity check incoming NOTIFY requests
  lockd: change nlmclnt_grant() to take a "struct sockaddr *"
  lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses
  lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET
  lockd: Support non-AF_INET addresses in nlm_lookup_host()
  NLM: Convert nlm_lookup_host() to use a single argument
  svcrdma: Add Fast Reg MR Data Types
  ...
2008-10-14 12:31:14 -07:00
Alexey Dobriyan
3bbfe05967 proc: remove kernel.maps_protect
After commit 831830b5a2 aka
"restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace"
sysctl stopped being relevant because commit moved security checks from ->show
time to ->start time (mm_for_maps()).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Kees Cook <kees.cook@canonical.com>
2008-10-10 04:24:51 +04:00
Alexey Dobriyan
45acb8db06 proc: remove now unneeded ADDBUF macro
After local seq_file conversion it was forgotten.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-10 04:18:58 +04:00
Kees Cook
4783072308 [PATCH] proc: show personality via /proc/pid/personality
Make process personality flags visible in /proc.  Since a process's
personality is potentially sensitive (e.g. READ_IMPLIES_EXEC), make this
file only readable by the process owner.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-10 04:18:57 +04:00
Lai Jiangshan
a6bebbc87a [PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock()
lock_task_sighand() make sure task->sighand is being protected,
so we do not need rcu_read_lock().
[ exec() will get task->sighand->siglock before change task->sighand! ]

But code using rcu_read_lock() _just_ to protect lock_task_sighand()
only appear in procfs. (and some code in procfs use lock_task_sighand()
without such redundant protection.)

Other subsystem may put lock_task_sighand() into rcu_read_lock()
critical region, but these rcu_read_lock() are used for protecting
"for_each_process()", "find_task_by_vpid()" etc. , not for protecting
lock_task_sighand().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
[ok from Oleg]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-10 04:18:57 +04:00
Alexey Dobriyan
53167a3ef2 proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-10 04:18:57 +04:00
Adrian Bunk
81324364b7 proc: make grab_header() static
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-10 04:18:56 +04:00
Alexey Dobriyan
a70973c214 proc: remove unused get_dma_list()
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-10 04:18:56 +04:00
Alexey Dobriyan
a04f4de641 proc: remove dummy vmcore_open()
Empty ->open is equivalent to always succeeding ->open.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-10 04:18:55 +04:00
Alexey Dobriyan
e1675231ce proc: proc_sys_root tweak
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-10 04:18:55 +04:00
Alexey Dobriyan
300b994b74 proc: fix return value of proc_reg_open() in "too late" case
If ->open() wasn't called, returning 0 is misleading and, theoretically,
oopsable:
1) remove_proc_entry clears ->proc_fops, drops lock,
2) ->open "succeeds",
3) ->release oopses, because it assumes ->open was called (single_release()).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-10 04:18:54 +04:00
Thomas Petazzoni
bfcd17a6c5 Configure out file locking features
This patch adds the CONFIG_FILE_LOCKING option which allows to remove
support for advisory locks. With this patch enabled, the flock()
system call, the F_GETLK, F_SETLK and F_SETLKW operations of fcntl()
and NFS support are disabled. These features are not necessarly needed
on embedded systems. It allows to save ~11 Kb of kernel code and data:

   text          data     bss     dec     hex filename
1125436        118764  212992 1457192  163c28 vmlinux.old
1114299        118564  212992 1445855  160fdf vmlinux
 -11137    -200       0  -11337   -2C49 +/-

This patch has originally been written by Matt Mackall
<mpm@selenic.com>, and is part of the Linux Tiny project.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: matthew@wil.cx
Cc: linux-fsdevel@vger.kernel.org
Cc: mpm@selenic.com
Cc: akpm@linux-foundation.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-09-29 17:56:57 -04:00
Frank Mayhar
f06febc96b timers: fix itimer/many thread hang
Overview

This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling.  It was put together
with the help of Roland McGrath, the owner and original writer of this code.

The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads.  It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.

This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

Code Changes

This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine.  (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.)  To do this, at each tick we now update fields in
signal_struct as well as task_struct.  The run_posix_cpu_timers() function
uses those fields to make its decisions.

We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:

struct task_cputime {
	cputime_t utime;
	cputime_t stime;
	unsigned long long sum_exec_runtime;
};

This is included in the structure "thread_group_cputime," which is a new
substructure of signal_struct and which varies for uniprocessor versus
multiprocessor kernels.  For uniprocessor kernels, it uses "task_cputime" as
a simple substructure, while for multiprocessor kernels it is a pointer:

struct thread_group_cputime {
	struct task_cputime totals;
};

struct thread_group_cputime {
	struct task_cputime *totals;
};

We also add a new task_cputime substructure directly to signal_struct, to
cache the earliest expiration of process-wide timers, and task_cputime also
replaces the it_*_expires fields of task_struct (used for earliest expiration
of thread timers).  The "thread_group_cputime" structure contains process-wide
timers that are updated via account_user_time() and friends.  In the non-SMP
case the structure is a simple aggregator; unfortunately in the SMP case that
simplicity was not achievable due to cache-line contention between CPUs (in
one measured case performance was actually _worse_ on a 16-cpu system than
the same test on a 4-cpu system, due to this contention).  For SMP, the
thread_group_cputime counters are maintained as a per-cpu structure allocated
using alloc_percpu().  The timer functions update only the timer field in
the structure corresponding to the running CPU, obtained using per_cpu_ptr().

We define a set of inline functions in sched.h that we use to maintain the
thread_group_cputime structure and hide the differences between UP and SMP
implementations from the rest of the kernel.  The thread_group_cputime_init()
function initializes the thread_group_cputime structure for the given task.
The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
in the per-cpu structures and fields.  The thread_group_cputime_free()
function, also a no-op for UP, in SMP frees the per-cpu structures.  The
thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
thread_group_cputime_alloc() if the per-cpu structures haven't yet been
allocated.  The thread_group_cputime() function fills the task_cputime
structure it is passed with the contents of the thread_group_cputime fields;
in UP it's that simple but in SMP it must also safely check that tsk->signal
is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
if so, sums the per-cpu values for each online CPU.  Finally, the three
functions account_group_user_time(), account_group_system_time() and
account_group_exec_runtime() are used by timer functions to update the
respective fields of the thread_group_cputime structure.

Non-SMP operation is trivial and will not be mentioned further.

The per-cpu structure is always allocated when a task creates its first new
thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
It is freed at process exit via a call to thread_group_cputime_free() from
cleanup_signal().

All functions that formerly summed utime/stime/sum_sched_runtime values from
from all threads in the thread group now use thread_group_cputime() to
snapshot the values in the thread_group_cputime structure or the values in
the task structure itself if the per-cpu structure hasn't been allocated.

Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
The run_posix_cpu_timers() function has been split into a fast path and a
slow path; the former safely checks whether there are any expired thread
timers and, if not, just returns, while the slow path does the heavy lifting.
With the dedicated thread group fields, timers are no longer "rebalanced" and
the process_timer_rebalance() function and related code has gone away.  All
summing loops are gone and all code that used them now uses the
thread_group_cputime() inline.  When process-wide timers are set, the new
task_cputime structure in signal_struct is used to cache the earliest
expiration; this is checked in the fast path.

Performance

The fix appears not to add significant overhead to existing operations.  It
generally performs the same as the current code except in two cases, one in
which it performs slightly worse (Case 5 below) and one in which it performs
very significantly better (Case 2 below).  Overall it's a wash except in those
two cases.

I've since done somewhat more involved testing on a dual-core Opteron system.

Case 1: With no itimer running, for a test with 100,000 threads, the fixed
	kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
	all of which was spent in the system.  There were twice as many
	voluntary context switches with the fix as without it.

Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
	an unmodified kernel can handle), the fixed kernel ran the test in
	eight percent of the time (5.8 seconds as opposed to 70 seconds) and
	had better tick accuracy (.012 seconds per tick as opposed to .023
	seconds per tick).

Case 3: A 4000-thread test with an initial timer tick of .01 second and an
	interval of 10,000 seconds (i.e. a timer that ticks only once) had
	very nearly the same performance in both cases:  6.3 seconds elapsed
	for the fixed kernel versus 5.5 seconds for the unfixed kernel.

With fewer threads (eight in these tests), the Case 1 test ran in essentially
the same time on both the modified and unmodified kernels (5.2 seconds versus
5.8 seconds).  The Case 2 test ran in about the same time as well, 5.9 seconds
versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
tick versus .025 seconds per tick for the unmodified kernel.

Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
	running), the modified kernel was very slightly favored in that while
	it killed the process in 19.997 seconds of CPU time (5.002 seconds of
	wall time), only .003 seconds of that was system time, the rest was
	user time.  The unmodified kernel killed the process in 20.001 seconds
	of CPU (5.014 seconds of wall time) of which .016 seconds was system
	time.  Really, though, the results were too close to call.  The results
	were essentially the same with no itimer running.

Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
	(where the hard limit would never be reached) and an itimer running,
	the modified kernel exhibited worse tick accuracy than the unmodified
	kernel: .050 seconds/tick versus .028 seconds/tick.  Otherwise,
	performance was almost indistinguishable.  With no itimer running this
	test exhibited virtually identical behavior and times in both cases.

In times past I did some limited performance testing.  those results are below.

On a four-cpu Opteron system without this fix, a sixteen-thread test executed
in 3569.991 seconds, of which user was 3568.435s and system was 1.556s.  On
the same system with the fix, user and elapsed time were about the same, but
system time dropped to 0.007 seconds.  Performance with eight, four and one
thread were comparable.  Interestingly, the timer ticks with the fix seemed
more accurate:  The sixteen-thread test with the fix received 149543 ticks
for 0.024 seconds per tick, while the same test without the fix received 58720
for 0.061 seconds per tick.  Both cases were configured for an interval of
0.01 seconds.  Again, the other tests were comparable.  Each thread in this
test computed the primes up to 25,000,000.

I also did a test with a large number of threads, 100,000 threads, which is
impossible without the fix.  In this case each thread computed the primes only
up to 10,000 (to make the runtime manageable).  System time dominated, at
1546.968 seconds out of a total 2176.906 seconds (giving a user time of
629.938s).  It received 147651 ticks for 0.015 seconds per tick, still quite
accurate.  There is obviously no comparable test without the fix.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-14 16:25:35 +02:00
Hugh Dickins
d7a3e4959c mm: ifdef Quicklists in /proc/meminfo
A "Quicklists:          0 kB" line has just started appearing in
/proc/meminfo, but most architectures (including x86) don't have
them configured, so #ifdef it, like the highmem lines.

And those architectures which do have quicklists configured are
using them for page tables: so let's place it next to PageTables.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13 14:41:51 -07:00
Alexey Dobriyan
665020c35e proc: more debugging for "already registered" case
Print parent directory name as well.

The aim is to catch non-creation of parent directory when proc_mkdir will
return NULL and all subsequent registrations go directly in /proc instead
of intended directory.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Fixed insane printk string while at it.  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13 14:41:50 -07:00
Balbir Singh
49048622ea sched: fix process time monotonicity
Spencer reported a problem where utime and stime were going negative despite
the fixes in commit b27f03d4bd. The suspected
reason for the problem is that signal_struct maintains it's own utime and
stime (of exited tasks), these are not updated using the new task_utime()
routine, hence sig->utime can go backwards and cause the same problem
to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
fixes the problem

TODO: using max(task->prev_utime, derived utime) works for now, but a more
generic solution is to implement cputime_max() and use the cputime_gt()
function for comparison.

Reported-by: spencer@bluehost.com
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-05 18:14:35 +02:00
KOSAKI Motohiro
4b8561521d mm: show quicklist usage in /proc/meminfo
Quicklists can consume several GB of memory.  We should provide a means of
monitoring this.

After this patch is applied, /proc/meminfo will output the following:

% cat /proc/meminfo

MemTotal:      7715392 kB
MemFree:       5401600 kB
Buffers:         80384 kB
Cached:         300800 kB
SwapCached:          0 kB
Active:         235584 kB
Inactive:       262656 kB
SwapTotal:     2031488 kB
SwapFree:      2031488 kB
Dirty:            3520 kB
Writeback:           0 kB
AnonPages:      117696 kB
Mapped:          38528 kB
Slab:          1589952 kB
SReclaimable:    23104 kB
SUnreclaim:    1566848 kB
PageTables:      14656 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
WritebackTmp:        0 kB
CommitLimit:   5889152 kB
Committed_AS:   393152 kB
VmallocTotal: 17592177655808 kB
VmallocUsed:     29056 kB
VmallocChunk: 17592177626432 kB
Quicklists:     130944 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
HugePages_Surp:      0
Hugepagesize:    262144 kB

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02 19:21:38 -07:00
Alexey Dobriyan
cc99609917 [PATCH] proc: inode number fixlet
Ouch, if number taken from IDA is too big, the intent was to signal an
error, not check for overflow and still do overflowing addition.

One still needs 2^28 proc entries to notice this.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-25 01:18:03 -04:00
Clement Calmels
1804dc6e14 /proc/self/maps doesn't display the real file offset
This addresses

	http://bugzilla.kernel.org/show_bug.cgi?id=11318

In function show_map (file: fs/proc/task_mmu.c), if vma->vm_pgoff > 2^20
than (vma->vm_pgoff << PAGE_SIZE) is greater than 2^32 (with PAGE_SIZE
equal to 4096 (i.e.  2^12).  The next seq_printf use an unsigned long for
the conversion of (vma->vm_pgoff << PAGE_SIZE), as a result the offset
value displayed in /proc/self/maps is truncated if the page offset is
greater than 2^20.

A test that shows this issue:

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

#define PAGE_SIZE (getpagesize())

#if __i386__
#   define U64_STR "%llx"
#elif __x86_64
#   define U64_STR "%lx"
#else
#   error "Architecture Unsupported"
#endif

int main(int argc, char *argv[])
{
	int fd;
	char *addr;
	off64_t offset = 0x10000000;
	char *filename = "/dev/zero";

	fd = open(filename, O_RDONLY);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	offset *= 0x10;
	printf("offset = " U64_STR "\n", offset);

	addr = (char*)mmap64(NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd,
			     offset);
	if ((void*)addr == MAP_FAILED) {
		perror("mmap64");
		return 1;
	}

	{
		FILE *fmaps;
		char *line = NULL;
		size_t len = 0;
		ssize_t read;
		size_t filename_len = strlen(filename);

		fmaps = fopen("/proc/self/maps", "r");
		if (!fmaps) {
			perror("fopen");
			return 1;
		}
		while ((read = getline(&line, &len, fmaps)) != -1) {
			if ((read > filename_len + 1)
			    && (strncmp(&line[read - filename_len - 1], filename, filename_len) == 0))
				printf("%s", line);
		}

		if (line)
			free(line);

		fclose(fmaps);
	}

	close(fd);
	return 0;
}

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Clement Calmels <cboulte@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-20 15:40:30 -07:00
Alexander Beregalov
7c44319dc6 proc: fix warnings
proc: fix warnings

 fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'u64'
 fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 4 has type 'u64'
 fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 5 has type 'u64'
 fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 6 has type 'u64'
 fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 7 has type 'u64'
 fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 8 has type 'u64'
 fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 9 has type 'u64'

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Acked-by: Andrea Righi <righi.andrea@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-05 14:33:50 -07:00
Alexey Dobriyan
9a18540915 [PATCH 2/2] proc: switch inode number allocation to IDA
proc doesn't use "associate pointer with id" feature of IDR, so switch
to IDA.

NOTE, NOTE, NOTE:
	Do not apply if release_inode_number() still mantions MAX_ID_MASK!

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-01 11:25:28 -04:00
Alexey Dobriyan
67935df49d [PATCH 1/2] proc: fix inode number bogorithmetic
Id which proc gets from IDR for inode number and id which proc removes
from IDR do not match. E.g. 0x11a transforms into 0x8000011a.

Which stayed unnoticed for a long time because, surprise, idr_remove()
masks out that high bit before doing anything.

All of this due to "| ~MAX_ID_MASK" in release_inode_number().

I still don't understand how it's supposed to work, because "| ~MASK"
is not an inversion for "& MAX" operation.

So, use just one nice, working addition. Make start offset unsigned int,
while I'm at it. It's longness is not used anywhere.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-01 11:25:27 -04:00
Andrea Righi
940389b8af task IO accounting: move all IO statistics in struct task_io_accounting
Simplify the code of include/linux/task_io_accounting.h.

It is also more reasonable to have all the task i/o-related statistics in a
single struct (task_io_accounting).

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-27 16:12:28 -07:00
Andrea Righi
5995477ab7 task IO accounting: improve code readability
Put all i/o statistics in struct proc_io_accounting and use inline functions to
initialize and increment statistics, removing a lot of single variable
assignments.

This also reduces the kernel size as following (with CONFIG_TASK_XACCT=y and
CONFIG_TASK_IO_ACCOUNTING=y).

    text    data     bss     dec     hex filename
   11651       0       0   11651    2d83 kernel/exit.o.before
   11619       0       0   11619    2d63 kernel/exit.o.after
   10886     132     136   11154    2b92 kernel/fork.o.before
   10758     132     136   11026    2b12 kernel/fork.o.after

 3082029  807968 4818600 8708597  84e1f5 vmlinux.o.before
 3081869  807968 4818600 8708437  84e155 vmlinux.o.after

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-27 09:58:20 -07:00
Linus Torvalds
4836e30078 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (39 commits)
  [PATCH] fix RLIM_NOFILE handling
  [PATCH] get rid of corner case in dup3() entirely
  [PATCH] remove remaining namei_{32,64}.h crap
  [PATCH] get rid of indirect users of namei.h
  [PATCH] get rid of __user_path_lookup_open
  [PATCH] f_count may wrap around
  [PATCH] dup3 fix
  [PATCH] don't pass nameidata to __ncp_lookup_validate()
  [PATCH] don't pass nameidata to gfs2_lookupi()
  [PATCH] new (local) helper: user_path_parent()
  [PATCH] sanitize __user_walk_fd() et.al.
  [PATCH] preparation to __user_walk_fd cleanup
  [PATCH] kill nameidata passing to permission(), rename to inode_permission()
  [PATCH] take noexec checks to very few callers that care
  Re: [PATCH 3/6] vfs: open_exec cleanup
  [patch 4/4] vfs: immutable inode checking cleanup
  [patch 3/4] fat: dont call notify_change
  [patch 2/4] vfs: utimes cleanup
  [patch 1/4] vfs: utimes: move owner check into inode_change_ok()
  [PATCH] vfs: use kstrdup() and check failing allocation
  ...
2008-07-26 20:23:44 -07:00
Andrea Righi
b2d002dba5 task IO accounting: correctly account threads IO statistics
Oleg Nesterov points out that we should check that the task is still alive
before we iterate over the threads.  This patch includes a fixup for this.

Also simplify do_io_accounting() implementation.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 20:16:47 -07:00
Al Viro
e6305c43ed [PATCH] sanitize ->permission() prototype
* kill nameidata * argument; map the 3 bits in ->flags anybody cares
  about to new MAY_... ones and pass with the mask.
* kill redundant gfs2_iop_permission()
* sanitize ecryptfs_permission()
* fix remaining places where ->permission() instances might barf on new
  MAY_... found in mask.

The obvious next target in that direction is permission(9)

folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-07-26 20:53:14 -04:00
Al Viro
9043476f72 [PATCH] sanitize proc_sysctl
* keep references to ctl_table_head and ctl_table in /proc/sys inodes
* grab the former during operations, use the latter for access to
  entry if that succeeds
* have ->d_compare() check if table should be seen for one who does lookup;
  that allows us to avoid flipping inodes - if we have the same name resolve
  to different things, we'll just keep several dentries and ->d_compare()
  will reject the wrong ones.
* have ->lookup() and ->readdir() scan the table of our inode first, then
  walk all ctl_table_header and scan ->attached_by for those that are
  attached to our directory.
* implement ->getattr().
* get rid of insane amounts of tree-walking
* get rid of the need to know dentry in ->permission() and of the contortions
  induced by that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-07-26 20:53:12 -04:00
Roland McGrath
ebcb67341f /proc/PID/syscall
This adds /proc/PID/syscall and /proc/PID/task/TID/syscall magic files.
These use task_current_syscall() to show the task's current system call
number and argument registers, stack pointer and PC.  For a task blocked
but not in a syscall, the file shows "-1" in place of the syscall number,
followed by only the SP and PC.  For a task that's not blocked, it shows
"running".

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:10 -07:00
Roland McGrath
0d094efeb1 tracehook: tracehook_tracer_task
This adds the tracehook_tracer_task() hook to consolidate all forms of
"Who is using ptrace on me?" logic.  This is used for "TracerPid:" in
/proc and for permission checks.  We also clean up the selinux code the
called an identical accessor.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:08 -07:00
Arjan van de Ven
267e2a9c71 Use WARN() in fs/proc/
Use WARN() instead of a printk+WARN_ON() pair; this way the message
becomes part of the warning section for better reporting/collection.
This way, the entire if() {} section can collapse into the WARN() as well.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:08 -07:00
Alexey Dobriyan
51cc50685a SL*B: drop kmem cache argument from constructor
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres.  Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.

Non-trivial places are:
	arch/powerpc/mm/init_64.c
	arch/powerpc/mm/hugetlbpage.c

This is flag day, yes.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:07 -07:00
Andrea Righi
297c5d9263 task IO accounting: provide distinct tgid/tid I/O statistics
Report per-thread I/O statistics in /proc/pid/task/tid/io and aggregate
parent I/O statistics in /proc/pid/io.  This approach follows the same
model used to account per-process and per-thread CPU times.

As a practial application, this allows for example to quickly find the top
I/O consumer when a process spawns many child threads that perform the
actual I/O work, because the aggregated I/O statistics can always be found
in /proc/pid/io.

[ Oleg Nesterov points out that we should check that the task is still
  alive before we iterate over the threads, but also says that we can do
  that fixup on top of this later.  - Linus ]

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Cc: Matt Heaton <matt@hostmonster.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Acked-by-with-comments: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:47 -07:00
Alexey Dobriyan
6eedf8d30d proc: move Kconfig to fs/proc/Kconfig
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:45 -07:00
Alexey Dobriyan
a9bd4a3e07 proc: remove pathetic remount code
MS_RMT_MASK will unmask changes in do_remount_sb() anyway.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:45 -07:00
Alexey Dobriyan
881adb8535 proc: always do ->release
Current two-stage scheme of removing PDE emphasizes one bug in proc:

		open
				rmmod
				remove_proc_entry
		close

->release won't be called because ->proc_fops were cleared.  In simple
cases it's small memory leak.

For every ->open, ->release has to be done.  List of openers is introduced
which is traversed at remove_proc_entry() if neeeded.

Discussions with Al long ago (sigh).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:44 -07:00
Adrian Bunk
6e644c3126 move proc_kmsg_operations to fs/proc/internal.h
This patch moves the extern of struct proc_kmsg_operations to
fs/proc/internal.h and adds an #include "internal.h" to fs/proc/kmsg.c
so that the latter sees the former.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:44 -07:00
Edgar E. Iglesias
79885b2277 elf: use ELF_CORE_EFLAGS for kcore ELF header flags
ELF_CORE_EFLAGS is already used by the binfmt_elf coredumper to set correct
arch specific ELF header flags on coredumps.  Use it for kcore dumps as well.
At the moment, this affects the CRIS and the H8300 arch.

Signed-off-by: Edgar E. Iglesias <edgar@axis.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:42 -07:00
Eric Dumazet
a47a126ad5 vmallocinfo: add NUMA information
Christoph recently added /proc/vmallocinfo file to get information about
vmalloc allocations.

This patch adds NUMA specific information, giving number of pages
allocated on each memory node.

This should help to check that vmalloc() is able to respect NUMA policies.

Example of output on a four nodes machine (one cpu per node)

1) network hash tables are evenly spreaded on four nodes (OK) (Same
   point for inodes and dentries hash tables)

2) iptables tables (x_tables) are correctly allocated on each cpu node
   (OK).

3) sys_swapon() allocates its memory from one node only.

4) each loaded module is using memory on one node.

Sysadmins could tune their setup to change points 3) and 4) if necessary.

grep "pages="  /proc/vmallocinfo
0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
0xffffc2000031a000-0xffffc2000031d000   12288 alloc_large_system_hash+0x204/0x2c0 pages=2 vmalloc N1=1 N2=1
0xffffc2000031f000-0xffffc2000032b000   49152 cramfs_uncompress_init+0x2e/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
0xffffc2000033e000-0xffffc20000341000   12288 sys_swapon+0x640/0xac0 pages=2 vmalloc N0=2
0xffffc20000341000-0xffffc20000344000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N0=2
0xffffc20000344000-0xffffc20000347000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N1=2
0xffffc20000347000-0xffffc2000034a000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N2=2
0xffffc2000034a000-0xffffc2000034d000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N3=2
0xffffc20004381000-0xffffc20004402000  528384 alloc_large_system_hash+0x204/0x2c0 pages=128 vmalloc N0=32 N1=32 N2=32 N3=32
0xffffc20004402000-0xffffc20004803000 4198400 alloc_large_system_hash+0x204/0x2c0 pages=1024 vmalloc vpages N0=256 N1=256 N2=256 N3=256
0xffffc20004803000-0xffffc20004904000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
0xffffc20004904000-0xffffc20004bec000 3047424 sys_swapon+0x640/0xac0 pages=743 vmalloc vpages N0=743
0xffffffffa0000000-0xffffffffa000f000   61440 sys_init_module+0xc27/0x1d00 pages=14 vmalloc N1=14
0xffffffffa000f000-0xffffffffa0014000   20480 sys_init_module+0xc27/0x1d00 pages=4 vmalloc N0=4
0xffffffffa0014000-0xffffffffa0017000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N0=2
0xffffffffa0017000-0xffffffffa0022000   45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N1=10
0xffffffffa0022000-0xffffffffa0028000   24576 sys_init_module+0xc27/0x1d00 pages=5 vmalloc N3=5
0xffffffffa0028000-0xffffffffa0050000  163840 sys_init_module+0xc27/0x1d00 pages=39 vmalloc N1=39
0xffffffffa0050000-0xffffffffa0052000    8192 sys_init_module+0xc27/0x1d00 pages=1 vmalloc N1=1
0xffffffffa0052000-0xffffffffa0056000   16384 sys_init_module+0xc27/0x1d00 pages=3 vmalloc N1=3
0xffffffffa0056000-0xffffffffa0081000  176128 sys_init_module+0xc27/0x1d00 pages=42 vmalloc N3=42
0xffffffffa0081000-0xffffffffa00ae000  184320 sys_init_module+0xc27/0x1d00 pages=44 vmalloc N3=44
0xffffffffa00ae000-0xffffffffa00b1000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
0xffffffffa00b1000-0xffffffffa00b9000   32768 sys_init_module+0xc27/0x1d00 pages=7 vmalloc N0=7
0xffffffffa00b9000-0xffffffffa00c4000   45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N3=10
0xffffffffa00c6000-0xffffffffa00e0000  106496 sys_init_module+0xc27/0x1d00 pages=25 vmalloc N2=25
0xffffffffa00e0000-0xffffffffa00f1000   69632 sys_init_module+0xc27/0x1d00 pages=16 vmalloc N2=16
0xffffffffa00f1000-0xffffffffa00f4000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
0xffffffffa00f4000-0xffffffffa00f7000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:17 -07:00
Adrian Bunk
c748e1340e mm/vmstat.c: proper externs
This patch adds proper extern declarations for five variables in
include/linux/vmstat.h

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:14 -07:00
Linus Torvalds
c010b2f76c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits)
  ipw2200: Call netif_*_queue() interfaces properly.
  netxen: Needs to include linux/vmalloc.h
  [netdrvr] atl1d: fix !CONFIG_PM build
  r6040: rework init_one error handling
  r6040: bump release number to 0.18
  r6040: handle RX fifo full and no descriptor interrupts
  r6040: change the default waiting time
  r6040: use definitions for magic values in descriptor status
  r6040: completely rework the RX path
  r6040: call napi_disable when puting down the interface and set lp->dev accordingly.
  mv643xx_eth: fix NETPOLL build
  r6040: rework the RX buffers allocation routine
  r6040: fix scheduling while atomic in r6040_tx_timeout
  r6040: fix null pointer access and tx timeouts
  r6040: prefix all functions with r6040
  rndis_host: support WM6 devices as modems
  at91_ether: use netstats in net_device structure
  sfc: Create one RX queue and interrupt per CPU package by default
  sfc: Use a separate workqueue for resets
  sfc: I2C adapter initialisation fixes
  ...
2008-07-22 19:09:51 -07:00
Adrian Bunk
8086cd451f netns: make get_proc_net() static
get_proc_net() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:19:19 -07:00
Alexey Dobriyan
ee1e6ab605 proc: fix /proc/*/pagemap some more
struct pagemap_walk was placed on stack, some hooks are initialized, the
rest (->pgd_entry, ->pud_entry, ->pte_entry) are valid but junk.

Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-22 09:59:41 -07:00
Linus Torvalds
db6d8c7a40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits)
  iucv: Fix bad merging.
  net_sched: Add size table for qdiscs
  net_sched: Add accessor function for packet length for qdiscs
  net_sched: Add qdisc_enqueue wrapper
  highmem: Export totalhigh_pages.
  ipv6 mcast: Omit redundant address family checks in ip6_mc_source().
  net: Use standard structures for generic socket address structures.
  ipv6 netns: Make several "global" sysctl variables namespace aware.
  netns: Use net_eq() to compare net-namespaces for optimization.
  ipv6: remove unused macros from net/ipv6.h
  ipv6: remove unused parameter from ip6_ra_control
  tcp: fix kernel panic with listening_get_next
  tcp: Remove redundant checks when setting eff_sacks
  tcp: options clean up
  tcp: Fix MD5 signatures for non-linear skbs
  sctp: Update sctp global memory limit allocations.
  sctp: remove unnecessary byteshifting, calculate directly in big-endian
  sctp: Allow only 1 listening socket with SO_REUSEADDR
  sctp: Do not leak memory on multiple listen() calls
  sctp: Support ipv6only AF_INET6 sockets.
  ...
2008-07-20 17:43:29 -07:00