Commit Graph

565 Commits

Author SHA1 Message Date
Christoph Lameter
1e275d406b [PATCH] page migration: Fix MPOL_INTERLEAVE behavior for migration via mbind()
migrate_pages_to() allocates a list of new pages on the intended target
node or with the intended policy and then uses the list of new pages as
targets for the migration of a list of pages out of place.

When the pages are allocated it is not clear which of the out of place
pages will be moved to the new pages.  So we cannot specify an address as
needed by alloc_page_vma().  This causes problem for MPOL_INTERLEAVE which
will currently allocate the pages on the first node of the set.  If mbind
is used with vma that has the policy of MPOL_INTERLEAVE then the
interleaving of pages may be destroyed.

This patch fixes that by generating a fake address for each alloc_page_vma
which will result is a distribution of pages as prescribed by
MPOL_INTERLEAVE.

Lee also noted that the sequence of nodes for the new pages seems to be
inverted.  So we also invert the way the lists of pages for migration are
build.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Looks-ok-to: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-24 14:31:38 -08:00
Hugh Dickins
b00dc3ad74 [PATCH] tmpfs: fix mount mpol nodelist parsing
I've been dissatisfied with the mpol_nodelist mount option which was
added to tmpfs earlier in -rc.  Replace it by mpol=policy:nodelist.

And it was broken: a nodelist is a comma-separated list of numbers and
ranges; the mount options are a comma-separated list of token=values.
Whoops, blindly strsep'ing on commas doesn't work so well: since we've
no numeric tokens, and unlikely to add them, use that to distinguish.

Move the mpol= parsing to shmem_parse_mpol under CONFIG_NUMA, reject
all its options as invalid if not NUMA.  /proc shows MPOL_PREFERRED
as "prefer", so use that name for the policy instead of "preferred".

Enforce that mpol=default has no nodelist; that mpol=prefer has one
node only; that mpol=bind has a nodelist; but let mpol=interleave use
node_online_map if no nodelist given.  Describe this in tmpfs.txt.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Robin Holt <holt@sgi.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-21 17:10:15 -08:00
Alexey Dobriyan
fcab6f3513 [PATCH] mm/mempolicy.c: fix 'if ();' typo
[akpm; it happens that the code was still correct, only inefficient ]

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:00:11 -08:00
Luke Yang
7a9166e3b0 [PATCH] Fix undefined symbols for nommu architecture
Signed-off-by: Luke Yang <luke.adi@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:00:11 -08:00
Andi Kleen
a9c930bac1 [PATCH] Fix units in mbind check
maxnode is a bit index and can't be directly compared against a byte length
like PAGE_SIZE

Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:00:10 -08:00
Christoph Lameter
9b0f8b040a [PATCH] Terminate process that fails on a constrained allocation
Some allocations are restricted to a limited set of nodes (due to memory
policies or cpuset constraints).  If the page allocator is not able to find
enough memory then that does not mean that overall system memory is low.

In particular going postal and more or less randomly shooting at processes
is not likely going to help the situation but may just lead to suicide (the
whole system coming down).

It is better to signal to the process that no memory exists given the
constraints that the process (or the configuration of the process) has
placed on the allocation behavior.  The process may be killed but then the
sysadmin or developer can investigate the situation.  The solution is
similar to what we do when running out of hugepages.

This patch adds a check before we kill processes.  At that point
performance considerations do not matter much so we just scan the zonelist
and reconstruct a list of nodes.  If the list of nodes does not contain all
online nodes then this is a constrained allocation and we should kill the
current process.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:00:09 -08:00
Kurt Garloff
9827b781f2 [PATCH] OOM kill: children accounting
In the badness() calculation, there's currently this piece of code:

        /*
         * Processes which fork a lot of child processes are likely
         * a good choice. We add the vmsize of the children if they
         * have an own mm. This prevents forking servers to flood the
         * machine with an endless amount of children
         */
        list_for_each(tsk, &p->children) {
                struct task_struct *chld;
                chld = list_entry(tsk, struct task_struct, sibling);
                if (chld->mm = p->mm && chld->mm)
                        points += chld->mm->total_vm;
        }

The intention is clear: If some server (apache) keeps spawning new children
and we run OOM, we want to kill the father rather than picking a child.

This -- to some degree -- also helps a bit with getting fork bombs under
control, though I'd consider this a desirable side-effect rather than a
feature.

There's one problem with this: No matter how many or few children there are,
if just one of them misbehaves, and all others (including the father) do
everything right, we still always kill the whole family.  This hits in real
life; whether it's javascript in konqueror resulting in kdeinit (and thus the
whole KDE session) being hit or just a classical server that spawns children.

Sidenote: The killer does kill all direct children as well, not only the
selected father, see oom_kill_process().

The idea in attached patch is that we do want to account the memory
consumption of the (direct) children to the father -- however not fully.
This maintains the property that fathers with too many children will still
very likely be picked, whereas a single misbehaving child has the chance to
be picked by the OOM killer.

In the patch I account only half (rounded up) of the children's vm_size to
the parent.  This means that if one child eats more mem than the rest of
the family, it will be picked, otherwise it's still the father and thus the
whole family that gets selected.

This is heuristics -- we could debate whether accounting for a fourth would
be better than for half of it.  Or -- if people would consider it worth the
trouble -- make it a sysctl.  For now I sticked to accounting for half,
which should IMHO be a significant improvement.

The patch does one more thing: As users tend to be irritated by the choice
of killed processes (mainly because the children are killed first, despite
some of them having a very low OOM score), I added some more output: The
selected (father) process will be reported first and it's oom_score printed
to syslog.

Description:

Only account for half of children's vm size in oom score calculation

This should still give the parent enough point in case of fork bombs.  If
any child however has more than 50% of the vm size of all children
together, it'll get a higher score and be elected.

This patch also makes the kernel display the oom_score.

Signed-off-by: Kurt Garloff <garloff@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:00:09 -08:00
Chris Wright
636f13c174 [PATCH] sys_mbind sanity checking
Make sure maxnodes is safe size before calculating nlongs in
get_nodes().

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17 14:09:22 -08:00
Linus Torvalds
4cf808eb44 [PATCH] Handle holes in node mask in node fallback list setup
Change the find_next_best_node algorithm to correctly skip
over holes in the node online mask. Previously it would not handle
missing nodes correctly and cause crashes at boot.

[Written by Linus, tested by AK]

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17 13:27:06 -08:00
Andi Kleen
dd942ae331 [PATCH] Handle all and empty zones when setting up custom zonelists for mbind
The memory allocator doesn't like empty zones (which have an
uninitialized freelist), so a x86-64 system with a node fully
in GFP_DMA32 only would crash on mbind.

Fix that up by putting all possible zones as fallback into the zonelist
and skipping the empty ones.

In fact the code always enough allocated space for all zones,
but only used it for the highest. This change just uses all the
memory that was allocated before.

This should work fine for now, but whoever implements node hot removal
needs to fix this somewhere else too (or make sure zone datastructures
by itself never go away, only their memory)

Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17 08:18:14 -08:00
Andi Kleen
a62eaf151d [PATCH] x86_64: Add boot option to disable randomized mappings and cleanup
AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution
installation it's easiest if it's a boot time option.

Also I moved the variable to a more appropiate place and make
it independent from sysctl

And marked __read_mostly which it is.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17 08:00:40 -08:00
Michael S. Tsirkin
f822566165 [PATCH] madvise MADV_DONTFORK/MADV_DOFORK
Currently, copy-on-write may change the physical address of a page even if the
user requested that the page is pinned in memory (either by mlock or by
get_user_pages).  This happens if the process forks meanwhile, and the parent
writes to that page.  As a result, the page is orphaned: in case of
get_user_pages, the application will never see any data hardware DMA's into
this page after the COW.  In case of mlock'd memory, the parent is not getting
the realtime/security benefits of mlock.

In particular, this affects the Infiniband modules which do DMA from and into
user pages all the time.

This patch adds madvise options to control whether memory range is inherited
across fork.  Useful e.g.  for when hardware is doing DMA from/into these
pages.  Could also be useful to an application wanting to speed up its forks
by cutting large areas out of consideration.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:34 -08:00
Hugh Dickins
d98c7a0984 [PATCH] compound page: default destructor
Somehow I imagined that calling a NULL destructor would free a compound page
rather than oopsing.  No, we must supply a default destructor, __free_pages_ok
using the order noted by prep_compound_page.  hugetlb can still replace this
as before with its own free_huge_page pointer.

The case that needs this is not common: rarely does put_compound_page's
put_page_testzero bring the count down to 0.  But if get_user_pages is applied
to some part of a compound page, without immediate release (e.g.  AIO or
Infiniband), then it's possible for its put_page to come after the containing
vma has been unmapped and the driver done its free_pages.

That's just the kind of case compound pages are supposed to be guarding
against (but Nick points out, nor did PageReserved handle this right).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:33 -08:00
Hugh Dickins
41d78ba550 [PATCH] compound page: use page[1].lru
If a compound page has its own put_page_testzero destructor (the only current
example is free_huge_page), that is noted in page[1].mapping of the compound
page.  But that's rather a poor place to keep it: functions which call
set_page_dirty_lock after get_user_pages (e.g.  Infiniband's
__ib_umem_release) ought to be checking first, otherwise set_page_dirty is
liable to crash on what's not the address of a struct address_space.

And now I'm about to make that worse: it turns out that every compound page
needs a destructor, so we can no longer rely on hugetlb pages going their own
special way, to avoid further problems of page->mapping reuse.  For example,
not many people know that: on 50% of i386 -Os builds, the first tail page of a
compound page purports to be PageAnon (when its destructor has an odd
address), which surprises page_add_file_rmap.

Keep the compound page destructor in page[1].lru.next instead.  And to free up
the common pairing of mapping and index, also move compound page order from
index to lru.prev.  Slab reuses page->lru too: but if we ever need slab to use
compound pages, it can easily stack its use above this.

(akpm: decoded version of the above: the tail pages of a compound page now
have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]()
caller to check that they're not compund pages before doing the dirty).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:33 -08:00
Christoph Lameter
2903fb1694 [PATCH] vmscan: skip reclaim_mapped determination if we do not swap
This puts the variables and the way to get to reclaim_mapped in one block.
And allows zone_reclaim or other things to skip the determination (maybe
this whole block of code does not belong into refill_inactive_zone()?)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-11 21:41:11 -08:00
Christoph Lameter
072eaa5d9c [PATCH] vmscan: remove duplicate increment of reclaim_in_progress
shrink_zone() already increments reclaim_in_progress.  No need to do it in
balance_pgdat.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-11 21:41:11 -08:00
Christoph Lameter
80e4342601 [PATCH] zone reclaim: do not check references to a page during zone reclaim
shrink_list() and refill_inactive() check all ptes pointing to a page for
reference bits in order to decide if the page should be put on the active
list.  This is not necessary for zone_reclaim since we are only interested
in removing unmapped pages.  Skip the checks in both functions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-11 21:41:11 -08:00
Christoph Lameter
418aade459 [PATCH] Updates for page migration
This adds some additional comments in order to help others figure out how
exactly the code works.  And fix a variable name.

Also swap_page does need to ignore all reference bits when unmapping a
page.  Otherwise we may have to repeatedly unmap a frequently touched page.
So change the try_to_unmap parameter to 1.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10 08:13:13 -08:00
Ravikiran G Thirumalai
f0188f4748 [PATCH] slab: Avoid deadlock at kmem_cache_create/kmem_cache_destroy
Prevents deadlock situation between
kmem_cache_create()/kmem_cache_destory(), and kmem_cache_create() /cpu
hotplug.  The locking order probably got moved over time.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10 08:13:12 -08:00
Ingo Molnar
9934a7939e [PATCH] SLOB=y && SMP=y fix
fix CONFIG_SLOB=y (when CONFIG_SMP=y): get rid of the 'align' parameter
from its __alloc_percpu() implementation. Boot-tested on x86.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-08 07:52:58 -08:00
Nick Piggin
8519fb30e4 [PATCH] mm: compound release fix
Compound pages on SMP systems can now often be freed from pagetables via
the release_pages path.  This uses put_page_testzero which does not handle
compound pages at all.  Releasing constituent pages from process mappings
decrements their count to a large negative number and leaks the reference
at the head page - net result is a memory leak.

The problem was hidden because the debug check in put_page_testzero itself
actually did take compound pages into consideration.

Fix the bug and the debug check.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:33 -08:00
Christoph Lameter
0df420d8b6 [PATCH] hugetlbpage: return VM_FAULT_OOM on oom
Remove wrong and misleading comments.

Return VM_FAULT_OOM if the hugetlbpage fault handler cannot allocate a
page.  do_no_page will end up doing do_exit(SIGKILL).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:31 -08:00
David Gibson
a2dfef6947 [PATCH] Hugepages need clear_user_highpage() not clear_highpage()
When hugepages are newly allocated to a file in mm/hugetlb.c, we clear them
with a call to clear_highpage() on each of the subpages.  We should be
using clear_user_highpage(): on powerpc, at least, clear_highpage() doesn't
correctly mark the page as icache dirty so if the page is executed shortly
after it's possible to get strange results.

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:31 -08:00
Linus Torvalds
7a21ef6fe9 mm/slab.c (non-NUMA): Fix compile warning and clean up code
The non-NUMA case would do an unmatched "free_alien_cache()" on an alien
pointer that had never been allocated.

It might not matter from a code generation standpoint (since in the
non-NUMA case, the code doesn't actually _do_ anything), but it not only
results in a compiler warning, it's really really ugly too.

Fix the compiler warning by just having a matching dummy allocation.
That also avoids an unnecessary #ifdef in the code.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:26:38 -08:00
Ravikiran G Thirumalai
4484ebf12b [PATCH] NUMA slab locking fixes: fix cpu down and up locking
This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab
allocator.  Sonny Rao <sonny@burdell.org> reported problems sometime back on
POWER5 boxes, when the last cpu on the nodes were being offlined.  We could
not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not
being updated on cpu down.  Since that issue is now fixed, we can reproduce
Sonny's problems on x86_64 NUMA, and here is the fix.

The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go
down, the array_caches (shared, alien) and the kmem_list3 of the node were
being freed (kfree) with the kmem_list3 lock held.  If the l3 or the
array_caches were to come from the same cache being cleared, we hit on
badness.

This patch cleans up the locking in cpu_up and cpu_down path.  We cannot
really free l3 on cpu down because, there is no node offlining yet and even
though a cpu is not yet up, node local memory can be allocated for it.  So l3s
are usually allocated at keme_cache_create and destroyed at
kmem_cache_destroy.  Hence, we don't need cachep->spinlock protection to get
to the cachep->nodelist[nodeid] either.

Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 4
dbench process running all the time.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:53 -08:00
Ravikiran G Thirumalai
ca3b9b9173 [PATCH] NUMA slab locking fixes: irq disabling from cahep->spinlock to l3 lock
Earlier, we had to disable on chip interrupts while taking the
cachep->spinlock because, at cache_grow, on every addition of a slab to a slab
cache, we incremented colour_next which was protected by the cachep->spinlock,
and cache_grow could occur at interrupt context.  Since, now we protect the
per-node colour_next with the node's list_lock, we do not need to disable on
chip interrupts while taking the per-cache spinlock, but we just need to
disable interrupts when taking the per-node kmem_list3 list_lock.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:53 -08:00
Ravikiran G Thirumalai
2e1217cf96 [PATCH] NUMA slab locking fixes: move color_next to l3
colour_next is used as an index to add a colouring offset to a new slab in the
cache (colour_off * colour_next).  Now with the NUMA aware slab allocator, it
makes sense to colour slabs added on the same node sequentially with
colour_next.

This patch moves the colouring index "colour_next" per-node by placing it on
kmem_list3 rather than kmem_cache.

This also helps simplify locking for CPU up and down paths.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:53 -08:00
Christoph Lameter
64b4a954b0 [PATCH] hugetlb: add comment explaining reasons for Bus Errors
I just spent some time researching a Bus Error.  Turns out that the huge
page fault handler can return VM_FAULT_SIGBUS for various conditions where
no huge page is available.

Add a note explaining the reasoning in the source.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:53 -08:00
Eric Dumazet
88a2a4ac6b [PATCH] percpu data: only iterate over possible CPUs
percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
cpudata, instead of allocating memory only for possible cpus.

As a preparation for changing that, we need to convert various 0 -> NR_CPUS
loops to use for_each_cpu().

(The above only applies to users of asm-generic/percpu.h.  powerpc has gone it
alone and is presently only allocating memory for present CPUs, so it's
currently corrupting memory).

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@suse.de>
Cc: Anton Blanchard <anton@samba.org>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:51 -08:00
Chen, Kenneth W
00ac59adfc [PATCH] x86_64: Fix memory policy build without CONFIG_HUGETLBFS
> mm/mempolicy.c: In function `huge_zonelist':
> mm/mempolicy.c:1045: error: `HPAGE_SHIFT' undeclared (first use in this function)
> mm/mempolicy.c:1045: error: (Each undeclared identifier is reported only once
> mm/mempolicy.c:1045: error: for each function it appears in.)
> make[1]: *** [mm/mempolicy.o] Error 1

Need to wrap huge_zonelist function with CONFIG_HUGETLBFS.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-04 16:43:14 -08:00
Randy Dunlap
ee13d785ea [PATCH] slab: fix sparse warning
mm/slab.c:1522:13: error: incompatible types for operation (&)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Randy.Dunlap
a70773ddb9 [PATCH] mm/slab: add kernel-doc for one function
Fix kernel-doc for calculate_slab_order().

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg
7fd6b14130 [PATCH] slab: fix kzalloc and kstrdup caller report for CONFIG_DEBUG_SLAB
Fix kzalloc() and kstrdup() caller report for CONFIG_DEBUG_SLAB.  We must
pass the caller to __cache_alloc() instead of directly doing
__builtin_return_address(0) there; otherwise kzalloc() and kstrdup() are
reported as the allocation site instead of the real one.

Thanks to Valdis Kletnieks for reporting the problem and Steven Rostedt for
the original idea.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Andrew Morton
b958f7d9f3 [PATCH] dump_stack() in oom handler
Sometimes it's nice to know who's calling.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg
343e0d7a93 [PATCH] slab: replace kmem_cache_t with struct kmem_cache
Replace uses of kmem_cache_t with proper struct kmem_cache in mm/slab.c.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg
9a2dba4b49 [PATCH] slab: rename ac_data to cpu_cache_get
Rename the ac_data() function to more descriptive cpu_cache_get().

Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg
6ed5eb2211 [PATCH] slab: extract virt_to_{cache|slab}
Introduce virt_to_cache() and virt_to_slab() functions to reduce duplicate
code and introduce a proper abstraction should we want to support other kind
of mapping for address to slab and cache (eg.  for vmalloc() or I/O memory).

Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg
5295a74cc0 [PATCH] slab: reduce inlining
From: Manfred Spraul <manfred@colorfullife.com>

Reduce the amount of inline functions in slab to the functions that
are used in the hot path:

  - no inline for debug functions
  - no __always_inline, inline is already __always_inline
  - remove inline from a few numa support functions.

Before:

   text    data     bss     dec     hex filename
  13588     752      48   14388    3834 mm/slab.o (defconfig)
  16671    2492      48   19211    4b0b mm/slab.o (numa)

After:

   text    data     bss     dec     hex filename
  13366     752      48   14166    3756 mm/slab.o (defconfig)
  16230    2492      48   18770    4952 mm/slab.o (numa)

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Matthew Dobson
78d382d77c [PATCH] slab: extract slab_{put|get}_obj
Create two helper functions slab_get_obj() and slab_put_obj() to replace
duplicated code in mm/slab.c

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Matthew Dobson
12dd36faec [PATCH] slab: extract slab_destroy_objs()
Create a helper function, slab_destroy_objs() which called from
slab_destroy().  This makes slab_destroy() smaller and more readable, and
moves ifdefs outside the function body.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Steven Rostedt
fbaccacff1 [PATCH] slab: cache_estimate cleanup
Clean up cache_estimate() in mm/slab.c and improves the algorithm from O(n) to
O(1).  We first calculate the maximum number of objects a slab can hold after
struct slab and kmem_bufctl_t for each object has been given enough space.
After that, to respect alignment rules, we decrease the number of objects if
necessary.  As required padding is at most align-1 and memory of obj_size is
at least align, it is always enough to decrease number of objects by one.

The optimization was originally made by Balbir Singh with more improvements
from Steven Rostedt.  Manfred Spraul provider further modifications: no loop
at all for the off-slab case and added comments to explain the background.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Steven Rostedt
5ec8a847bb [PATCH] slab: have index_of bug at compile time
I noticed the code for index_of is a creative way of finding the cache
index using the compiler to optimize to a single hard coded number.  But
I couldn't help noticing that it uses two methods to let you know that
someone used it wrong.  One is at compile time (the correct way), and
the other is at run time (not good).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Christoph Lameter
18f820f655 [PATCH] slab: minor cleanup to kmem_cache_alloc_node
Clean up kmem_cache_alloc_node a bit.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Manfred Spraul
3dafccf227 [PATCH] slab: distinguish between object and buffer size
An object cache has two different object lengths:

  - the amount of memory available for the user (object size)
  - the amount of memory allocated internally (buffer size)

This patch does some renames to make the code reflect that better.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Christoph Lameter
e965f9630c [PATCH] Direct Migration V9: Avoid writeback / page_migrate() method
Migrate a page with buffers without requiring writeback

This introduces a new address space operation migratepage() that may be used
by a filesystem to implement its own version of page migration.

A version is provided that migrates buffers attached to pages.  Some
filesystems (ext2, ext3, xfs) are modified to utilize this feature.

The swapper address space operation are modified so that a regular
migrate_page() will occur for anonymous pages without writeback (migrate_pages
forces every anonymous page to have a swap entry).

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Christoph Lameter
7e2ab150d1 [PATCH] Direct Migration V9: upgrade MPOL_MF_MOVE and sys_migrate_pages()
Modify policy layer to support direct page migration

- Add migrate_pages_to() allowing the migration of a list of pages to a a
  specified node or to vma with a specific allocation policy in sets of
  MIGRATE_CHUNK_SIZE pages

- Modify do_migrate_pages() to do a staged move of pages from the source
  nodes to the target nodes.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:16 -08:00
Christoph Lameter
a3351e525e [PATCH] Direct Migration V9: remove_from_swap() to remove swap ptes
Add remove_from_swap

remove_from_swap() allows the restoration of the pte entries that existed
before page migration occurred for anonymous pages by walking the reverse
maps.  This reduces swap use and establishes regular pte's without the need
for page faults.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:16 -08:00
Christoph Lameter
a48d07afdf [PATCH] Direct Migration V9: migrate_pages() extension
Add direct migration support with fall back to swap.

Direct migration support on top of the swap based page migration facility.

This allows the direct migration of anonymous pages and the migration of file
backed pages by dropping the associated buffers (requires writeout).

Fall back to swap out if necessary.

The patch is based on lots of patches from the hotplug project but the code
was restructured, documented and simplified as much as possible.

Note that an additional patch that defines the migrate_page() method for
filesystems is necessary in order to avoid writeback for anonymous and file
backed pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:16 -08:00
Christoph Lameter
b16664e44c [PATCH] Direct Migration V9: PageSwapCache checks
Check for PageSwapCache after looking up and locking a swap page.

The page migration code may change a swap pte to point to a different page
under lock_page().

If that happens then the vm must retry the lookup operation in the swap space
to find the correct page number.  There are a couple of locations in the VM
where a lock_page() is done on a swap page.  In these locations we need to
check afterwards if the page was migrated.  If the page was migrated then the
old page that was looked up before was freed and no longer has the
PageSwapCache bit set.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:16 -08:00
Christoph Lameter
2a16e3f4b0 [PATCH] Reclaim slab during zone reclaim
If large amounts of zone memory are used by empty slabs then zone_reclaim
becomes uneffective.  This patch shakes the slab a bit.

The problem with this patch is that the slab reclaim is not containable to a
zone.  Thus slab reclaim may affect the whole system and be extremely slow.
This also means that we cannot determine how many pages were freed in this
zone.  Thus we need to go off node for at least one allocation.

The functionality is disabled by default.

We could modify the shrinkers to take a zone parameter but that would be quite
invasive.  Better ideas are welcome.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:16 -08:00
Christoph Lameter
1b2ffb7896 [PATCH] Zone reclaim: Allow modification of zone reclaim behavior
In some situations one may want zone_reclaim to behave differently.  For
example a process writing large amounts of memory will spew unto other nodes
to cache the writes if many pages in a zone become dirty.  This may impact the
performance of processes running on other nodes.

Allowing writes during reclaim puts a stop to that behavior and throttles the
process by restricting the pages to the local zone.

Similarly one may want to contain processes to local memory by enabling
regular swap behavior during zone_reclaim.  Off node memory allocation can
then be controlled through memory policies and cpusets.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:16 -08:00
Christoph Lameter
2a11ff06d7 [PATCH] zone_reclaim: configurable off node allocation period.
Currently the zone_reclaim code has a fixed window of 30 seconds of off node
allocations should a local zone have no unused pagecache pages left.  Reclaim
will be attempted again after this timeout period to avoid repeated useless
scans for memory.  This is also useful to established sufficiently large off
node allocation chunks to relieve the local node.

It may be beneficial to adjust that time period for some special situations.
For example if memory use was exceeding node capacity one may want to give up
for longer periods of time.  If memory spikes intermittendly then one may want
to shorten the time period to reduce the number of off node allocations.

This patch allows just that....

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:16 -08:00
Christoph Lameter
a92f71263a [PATCH] zone_reclaim: partial scans instead of full scan
Instead of scanning all the pages in a zone, imitate real swap and scan
only a portion of the pages and gradually scan more if we do not free up
enough pages.  This avoids a zone suddenly loosing all unused pagecache
pages (we may after all access some of these again so they deserve another
chance) but it still frees up large chunks of memory if a zone only
contains unused pagecache pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:16 -08:00
Christoph Lameter
aa3f18b339 [PATCH] zone_reclaim: do not unmap file backed pages
zone_reclaim should leave that to the real swapper.  We are only interested
in evicting unmapped pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:15 -08:00
Benjamin LaHaise
9884fd8df1 [PATCH] Use 32 bit division in slab_put_obj()
Improve the performance of slab_put_obj().  Without the cast, gcc considers
ptrdiff_t a 64 bit signed integer and ends up emitting code to use a full
signed 128 bit divide on EM64T, which is substantially slower than a 32 bit
unsigned divide.

I noticed this when looking at the profile of a case where the slab balance
is just on edge and thrashes back and forth freeing a block.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:15 -08:00
Christoph Lameter
c84db23c6e [PATCH] zone_reclaim: minor fixes
- If we only reclaim nr_pages then its okay to stay on node.
  Switch from > to >= for the comparison.

- vm_table[] entry for zone_reclaim_mode is a bit screwed up.

- Add empty lines around shrink_zone to show that this is the
  central function to be called.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:15 -08:00
Christoph Lameter
52a8363eae [PATCH] mm: improve function of sc->may_writepage
Make sc->may_writepage control the writeout behavior of shrink_list.

Remove the laptop_mode trick from shrink_list and instead set may_writepage
in try_to_free_pages properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:15 -08:00
Christoph Lameter
42c722d4cb [PATCH] zone_reclaim: reclaim on memory only node support
Zone reclaim is usually only run on the local node.  Headless nodes do not
have any local processors.  This patch checks for headless nodes and
performs zone reclaim on them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:14 -08:00
Christoph Lameter
8928862398 [PATCH] Optimize off-node performance of zone reclaim
Ensure that the performance of off node pages stays the same as before.
Off node pagefault tests showed an 18% drop in performance without this
patch.

- Increase the timeout to 30 seconds to reduce the overhead.

- Move all code possible out of the off node hot path for zone reclaim
  (Sorry Andrew, the struct initialization had to be sacrificed).
  The read_page_state() bit us there.

- Check first for the timeout before any other checks.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:14 -08:00
Ashok Raj
6292d9aaf3 [PATCH] __cpuinit functions wrongly marked __meminit
__meminit has overzelously been modified and crept its way into marking
cpuup callbacks as __meminit.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:09 -08:00
Christoph Lameter
86c562a9d6 [PATCH] mm: optimize numa policy handling in slab allocator
Move the interrupt check from slab_node into ___cache_alloc and adds an
"unlikely()" to avoid pipeline stalls on some architectures.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:18 -08:00
Christoph Lameter
dc85da15d4 [PATCH] NUMA policies in the slab allocator V2
This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
imbalance in memory allocation during bootup.

The slab allocator in 2.6.13 is not numa aware and simply calls
alloc_pages().  This means that memory policies may control the behavior of
alloc_pages().  During bootup the memory policy is set to MPOL_INTERLEAVE
resulting in the spreading out of allocations during bootup over all
available nodes.  The slab allocator in 2.6.13 has only a single list of
slab pages.  As a result the per cpu slab cache and the spinlock controlled
page lists may contain slab entries from off node memory.  The slab
allocator in 2.6.13 makes no effort to discern the locality of an entry on
its lists.

The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
explicitly by calling alloc_pages_node().  The NUMA slab allocator manages
slab entries by having lists of available slab pages for each node.  The
per cpu slab cache can only contain slab entries associated with the node
local to the processor.  This guarantees that the default allocation mode
of the slab allocator always assigns local memory if available.

Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
anymore.  In 2.6.14 all node unspecific slab allocations are performed on
the boot processor.  This means that most of key data structures are
allocated on one node.  Most processors will have to refer to these
structures making the boot node a potential bottleneck.  This may reduce
performance and cause unnecessary memory pressure on the boot node.

This patch implements NUMA policies in the slab layer.  There is the need
of explicit application of NUMA memory policies by the slab allcator itself
since the NUMA slab allocator does no longer let the page_allocator control
locality.

The check for policies is made directly at the beginning of __cache_alloc
using current->mempolicy.  The memory policy is already frequently checked
by the page allocator (alloc_page_vma() and alloc_page_current()).  So it
is highly likely that the cacheline is present.  For MPOL_INTERLEAVE
kmalloc() will spread out each request to one node after another so that an
equal distribution of allocations can be obtained during bootup.

It is not possible to push the policy check to lower layers of the NUMA
slab allocator since the per cpu caches are now only containing slab
entries from the current node.  If the policy says that the local node is
not to be preferred or forbidden then there is no point in checking the
slab cache or local list of slab pages.  The allocation better be directed
immediately to the lists containing slab entries for the allowed set of
nodes.

This way of applying policy also fixes another strange behavior in 2.6.13.
alloc_pages() is controlled by the memory allocation policy of the current
process.  It could therefore be that one process is running with
MPOL_INTERLEAVE and would f.e.  obtain a new page following that policy
since no slab entries are in the lists anymore.  A page can typically be
used for multiple slab entries but lets say that the current process is
only using one.  The other entries are then added to the slab lists.  These
are now non local entries in the slab lists despite of the possible
availability of local pages that would provide faster access and increase
the performance of the application.

Another process without MPOL_INTERLEAVE may now run and expect a local slab
entry from kmalloc().  However, there are still these free slab entries
from the off node page obtained from the other process via MPOL_INTERLEAVE
in the cache.  The process will then get an off node slab entry although
other slab entries may be available that are local to that process.  This
means that the policy if one process may contaminate the locality of the
slab caches for other processes.

This patch in effect insures that a per process policy is followed for the
allocation of slab entries and that there cannot be a memory policy
influence from one process to another.  A process with default policy will
always get a local slab entry if one is available.  And the process using
memory policies will get its memory arranged as requested.  Off-node slab
allocation will require the use of spinlocks and will make the use of per
cpu caches not possible.  A process using memory policies to redirect
allocations offnode will have to cope with additional lock overhead in
addition to the latency added by the need to access a remote slab entry.

Changes V1->V2
- Remove #ifdef CONFIG_NUMA by moving forward declaration into
  prior #ifdef CONFIG_NUMA section.

- Give the function determining the node number to use a saner
  name.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:18 -08:00
Ingo Molnar
fc0abb1451 [PATCH] sem2mutex: mm/slab.c
Convert mm/swapfile.c's swapon_sem to swapon_mutex.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:18 -08:00
Christoph Lameter
9eeff2395e [PATCH] Zone reclaim: Reclaim logic
Some bits for zone reclaim exists in 2.6.15 but they are not usable.  This
patch fixes them up, removes unused code and makes zone reclaim usable.

Zone reclaim allows the reclaiming of pages from a zone if the number of
free pages falls below the watermarks even if other zones still have enough
pages available.  Zone reclaim is of particular importance for NUMA
machines.  It can be more beneficial to reclaim a page than taking the
performance penalties that come with allocating a page on a remote zone.

Zone reclaim is enabled if the maximum distance to another node is higher
than RECLAIM_DISTANCE, which may be defined by an arch.  By default
RECLAIM_DISTANCE is 20.  20 is the distance to another node in the same
component (enclosure or motherboard) on IA64.  The meaning of the NUMA
distance information seems to vary by arch.

If zone reclaim is not successful then no further reclaim attempts will
occur for a certain time period (ZONE_RECLAIM_INTERVAL).

This patch was discussed before. See

http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Christoph Lameter
f1fd1067ec [PATCH] Zone reclaim: resurrect may_swap
Zone reclaim has a huge impact on NUMA performance (f.e.  our maximum
throughput with XFS is raised from 4GB to 6GB/sec / page cache contamination
of numa nodes destroys locality if one just does a large copy operation which
results in performance dropping for good until reboot).

This patch:

Resurrect may_swap in struct scan_control

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Christoph Lameter
fc30128963 [PATCH] Simplify migrate_page_add
Simplify migrate_page_add after feedback from Hugh.  This also allows us to
drop one parameter from migrate_page_add.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Nick Piggin
053837fce7 [PATCH] mm: migration page refcounting fix
Migration code currently does not take a reference to target page
properly, so between unlocking the pte and trying to take a new
reference to the page with isolate_lru_page, anything could happen to
it.

Fix this by holding the pte lock until we get a chance to elevate the
refcount.

Other small cleanups while we're here.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Andrew Morton
e236a166b2 [PATCH] mm: dirty_exceeded speedup
Ravikiran reports that this variable is bouncing all around nodes on NUMA
machines, causing measurable performance problems.  Fix that up by only
writing to it when it actually changed.

And put it in a new cacheline to prevent it sharing with other things (this
happened).

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Matt Tolentino
c09b42404d [PATCH] x86_64: add __meminit for memory hotplug
Add __meminit to the __init lineup to ensure functions default
to __init when memory hotplug is not enabled.  Replace __devinit
with __meminit on functions that were changed when the memory
hotplug code was introduced.

Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16 23:18:35 -08:00
Paul Jackson
505970b96e [PATCH] cpuset oom lock fix
The problem, reported in:

  http://bugzilla.kernel.org/show_bug.cgi?id=5859

and by various other email messages and lkml posts is that the cpuset hook
in the oom (out of memory) code can try to take a cpuset semaphore while
holding the tasklist_lock (a spinlock).

One must not sleep while holding a spinlock.

The fix seems easy enough - move the cpuset semaphore region outside the
tasklist_lock region.

This required a few lines of mechanism to implement.  The oom code where
the locking needs to be changed does not have access to the cpuset locks,
which are internal to kernel/cpuset.c only.  So I provided a couple more
cpuset interface routines, available to the rest of the kernel, which
simple take and drop the lock needed here (cpusets callback_sem).

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14 18:27:10 -08:00
Robin Holt
7339ff8302 [PATCH] Add tmpfs options for memory placement policies
Anything that writes into a tmpfs filesystem is liable to disproportionately
decrease the available memory on a particular node.  Since there's no telling
what sort of application (e.g.  dd/cp/cat) might be dropping large files
there, this lets the admin choose the appropriate default behavior for their
site's situation.

Introduce a tmpfs mount option which allows specifying a memory policy and
a second option to specify the nodelist for that policy.  With the default
policy, tmpfs will behave as it does today.  This patch adds support for
preferred, bind, and interleave policies.

The default policy will cause pages to be added to tmpfs files on the node
which is doing the writing.  Some jobs expect a single process to create
and manage the tmpfs files.  This results in a node which has a
significantly reduced number of free pages.

With this patch, the administrator can specify the policy and nodes for
that policy where they would prefer allocations.

This patch was originally written by Brent Casavant and Hugh Dickins.  I
added support for the bind and preferred policies and the mpol_nodelist
mount option.

Signed-off-by: Brent Casavant <bcasavan@sgi.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14 18:27:07 -08:00
Linus Torvalds
9f5974c873 Merge git://oss.sgi.com:8090/oss/git/xfs-2.6 2006-01-12 09:10:34 -08:00
Greg Ungerer
cbe8dd4af2 [PATCH] memmap_init_zone(): remove uneccesary page++
Remove unecessary page++ from memmap_init_zone loop.

Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:49 -08:00
Catalin Marinas
2a7e2f7dcb [PATCH] do_truncate() call fix in tiny-shmem.c
Adapt tiny-shmem.c to the new do_truncate() prototype.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:49 -08:00
Christoph Lameter
f4598c8b36 [PATCH] migration: make sure there is no attempt to migrate reserved pages.
This ensures that reserved pages are not migrated.  Reserved pages
currently cause the WARN_ON to trigger in migrate_page_add()

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:48 -08:00
Randy.Dunlap
c59ede7b78 [PATCH] move capable() to capability.h
- Move capable() from sched.h to capability.h;

- Use <linux/capability.h> where capable() is used
	(in include/, block/, ipc/, kernel/, a few drivers/,
	mm/, security/, & sound/;
	many more drivers/ to go)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:13 -08:00
Paul Jackson
4eac915d02 [PATCH] mm: gfp_atomic comments
Clarify in comments that GFP_ATOMIC means both "don't sleep" and "use
emergency pools", hence both ALLOC_HARDER and ALLOC_HIGH.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:09 -08:00
Hugh Dickins
7365f3d169 [PATCH] Restore KERN_EMERG to each line printed by bad_page
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:08 -08:00
Nathan Scott
ddae9c2ea7 Merge HEAD from oss.sgi.com:/oss/git/linux-2.6.git 2006-01-12 13:34:47 +11:00
David Woodhouse
a4fc7ab1d0 [PATCH] fix/simplify mutex debugging code
Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as
arguments instead, since all its callers were just calculating the 'to'
address for themselves anyway... (and sometimes doing so badly).

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 08:14:16 -08:00
Christoph Hellwig
78539fdfa4 [XFS] Export pagevec_lookup for use on the XFS page writeout path,
for dealing with delayed allocate and unwritten extents (as well).

Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
2006-01-11 20:47:41 +11:00
Jesper Juhl
e97a31117c add missing printk loglevel in mm/swapfile.c
in mm/swapfile.c a printk() is missing a loglevel. I believe the proper
loglevel for this situation is KERN_ERR, so that's what the patch below
sets -if you agree, please apply.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-01-11 01:50:28 +01:00
Christoph Hellwig
870f481793 [PATCH] replace inode_update_time with file_update_time
To allow various options to work per-mount instead of per-sb we need a
struct vfsmount when updating ctime and mtime.  This preparation patch
replaces the inode_update_time routine with a file_update_atime routine so
we can easily get at the vfsmount.  (and the file makes more sense in this
context anyway).  Also get rid of the unused second argument - we always
want to update the ctime when calling this routine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:30 -08:00
Jes Sorensen
1b1dcc1b57 [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.

Modified-by: Ingo Molnar <mingo@elte.hu>

(finished the conversion)

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-09 15:59:24 -08:00
Ingo Molnar
de5097c2e7 [PATCH] mutex subsystem, more debugging code
more mutex debugging: check for held locks during memory freeing,
task exit, enable sysrq printouts, etc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09 15:59:21 -08:00
Linus Torvalds
6150c32589 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc-merge 2006-01-09 10:03:44 -08:00
Valentine Barshak
87ba81dba4 [PATCH] fadvise: return ESPIPE on FIFO/pipe
The patch makes posix_fadvise return ESPIPE on FIFO/pipe in order to be
fully POSIX-compliant.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:00 -08:00
OGAWA Hirofumi
28fd129827 [PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.

See mm/filemap.c:

And changes the filemap_write_and_wait() and filemap_write_and_wait_range().

Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error.  However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)

<quotation>
Andrew Morton writes,

If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc.  Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>

So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.

Trond, could you please review the nfs part?  Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.

Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:47 -08:00
OGAWA Hirofumi
268fc16e34 [PATCH] export/change sync_page_range/_nolock()
This exports/changes the sync_page_range/_nolock().  The fatfs needs
sync_page_range/_nolock() for expanding truncate, and changes "size_t count"
to "loff_t count".

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:47 -08:00
Paul Jackson
4225399a66 [PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.

NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset.  The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.

An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.

This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired.  The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.

Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan.  In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock).  So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.

Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.

When a task is moved to a different cpuset, it is easier, as there is only one
task involved.  It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.

It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice.  This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson
74cb21553f [PATCH] cpuset: numa_policy_rebind cleanup
Cleanup, reorganize and make more robust the mempolicy.c code to rebind
mempolicies relative to the containing cpuset after a tasks memory placement
changes.

The real motivator for this cleanup patch is to lay more groundwork for the
upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
after the containing cpuset memory placement changes.

NUMA mempolicies are constrained by the cpuset their task is a member of.
When either (1) a task is moved to a different cpuset, or (2) the 'mems'
mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
be recalculated, relative to their new cpuset placement.

The old code used an unreliable method of determining what was the old
mems_allowed constraining the mempolicy.  It just looked at the tasks
mems_allowed value.  This sort of worked with the present code, that just
rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
referring to the old nodes.  But in an upcoming patch, the vma mempolicies
will be rebound as well.  Then the order in which the various task and vma
mempolicies are updated will no longer be deterministic, and one can no longer
count on the task->mems_allowed holding the old value for as long as needed.
It's not even clear if the current code was guaranteed to work reliably for
task mempolicies.

So I added a mems_allowed field to each mempolicy, stating exactly what
mems_allowed the policy is relative to, and updated synchronously and reliably
anytime that the mempolicy is rebound.

Also removed a useless wrapper routine, numa_policy_rebind(), and had its
caller, cpuset_update_task_memory_state(), call directly to the rewritten
policy_rebind() routine, and made that rebind routine extern instead of
static, and added a "mpol_" prefix to its name, making it
mpol_rebind_policy().

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson
909d75a3b7 [PATCH] cpuset: implement cpuset_mems_allowed
Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
needed, to obtain the mems_allowed vector of a cpuset, and replaced the
workaround in sys_migrate_pages() to call this new method.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson
cf2a473c40 [PATCH] cpuset: combine refresh_mems and update_mems
The important code paths through alloc_pages_current() and alloc_page_vma(),
by which most kernel page allocations go, both called
cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
-Both- of these latter two routines did a tasklock, got the tasks cpuset
pointer, and checked for out of date cpuset->mems_generation.

That was a silly duplication of code and waste of CPU cycles on an important
code path.

Consolidated those two routines into a single routine, called
cpuset_update_task_memory_state(), since it updates more than just
mems_allowed.

Changed all callers of either routine to call the new consolidated routine.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:43 -08:00
Paul Jackson
3e0d98b9f1 [PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.

This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.

This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.

This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure.  It's up to the batch
manager or other user code to decide what to do about it and take action.

==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero.  So only
    systems that enable this feature will compute the metric.

Why a per-cpuset, running average:

    Because this meter is per-cpuset, rather than per-task or mm, the
    system load imposed by a batch scheduler monitoring this metric is
    sharply reduced on large systems, because a scan of the tasklist can be
    avoided on each set of queries.

    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a single
    read, instead of having to read and accumulate results for a period of
    time.

    Because this meter is per-cpuset rather than per-task or mm, the
    batch scheduler can obtain the key information, memory pressure in a
    cpuset, with a single read, rather than having to query and accumulate
    results over all the (dynamically changing) set of tasks in the cpuset.

A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.

A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:42 -08:00
Paul Jackson
5966514db6 [PATCH] cpuset: mempolicy one more nodemask conversion
Finish converting mm/mempolicy.c from bitmaps to nodemasks.  The previous
conversion had left one routine using bitmaps, since it involved a
corresponding change to kernel/cpuset.c

Fix that interface by replacing with a simple macro that calls nodes_subset(),
or if !CONFIG_CPUSET, returns (1).

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:42 -08:00
Matt Mackall
10cef60295 [PATCH] slob: introduce the SLOB allocator
configurable replacement for slab allocator

This adds a CONFIG_SLAB option under CONFIG_EMBEDDED.  When CONFIG_SLAB is
disabled, the kernel falls back to using the 'SLOB' allocator.

SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
similar to the original Linux kmalloc allocator that SLAB replaced.  It's
signicantly smaller code and is more memory efficient.  But like all
similar allocators, it scales poorly and suffers from fragmentation more
than SLAB, so it's only appropriate for small systems.

It's been tested extensively in the Linux-tiny tree.  I've also
stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
recommended).

Here's a comparison for otherwise identical builds, showing SLOB saving
nearly half a megabyte of RAM:

$ size vmlinux*
   text    data     bss     dec     hex filename
3336372  529360  190812 4056544  3de5e0 vmlinux-slab
3323208  527948  190684 4041840  3dac70 vmlinux-slob

$ size mm/{slab,slob}.o
   text    data     bss     dec     hex filename
  13221     752      48   14021    36c5 mm/slab.o
   1896      52       8    1956     7a4 mm/slob.o

/proc/meminfo:
                  SLAB          SLOB      delta
MemTotal:        27964 kB      27980 kB     +16 kB
MemFree:         24596 kB      25092 kB    +496 kB
Buffers:            36 kB         36 kB       0 kB
Cached:           1188 kB       1188 kB       0 kB
SwapCached:          0 kB          0 kB       0 kB
Active:            608 kB        600 kB      -8 kB
Inactive:          808 kB        812 kB      +4 kB
HighTotal:           0 kB          0 kB       0 kB
HighFree:            0 kB          0 kB       0 kB
LowTotal:        27964 kB      27980 kB     +16 kB
LowFree:         24596 kB      25092 kB    +496 kB
SwapTotal:           0 kB          0 kB       0 kB
SwapFree:            0 kB          0 kB       0 kB
Dirty:               4 kB         12 kB      +8 kB
Writeback:           0 kB          0 kB       0 kB
Mapped:            560 kB        556 kB      -4 kB
Slab:             1756 kB          0 kB   -1756 kB
CommitLimit:     13980 kB      13988 kB      +8 kB
Committed_AS:     4208 kB       4208 kB       0 kB
PageTables:         28 kB         28 kB       0 kB
VmallocTotal:  1007312 kB    1007312 kB       0 kB
VmallocUsed:        48 kB         48 kB       0 kB
VmallocChunk:  1007264 kB    1007264 kB       0 kB

(this work has been sponsored in part by CELF)

From: Ingo Molnar <mingo@elte.hu>

   Fix 32-bitness bugs in mm/slob.c.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:41 -08:00
Matt Mackall
30992c97ae [PATCH] slob: introduce mm/util.c for shared functions
Add mm/util.c for functions common between SLAB and SLOB.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:41 -08:00
Ravikiran G Thirumalai
22fc6eccbf [PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros
____cacheline_maxaligned_in_smp is currently used to align critical structures
and avoid false sharing.  It uses per-arch L1_CACHE_SHIFT_MAX and people find
L1_CACHE_SHIFT_MAX useless.

However, we have been using ____cacheline_maxaligned_in_smp to align
structures on the internode cacheline size.  As per Andi's suggestion,
following patch kills ____cacheline_maxaligned_in_smp and introduces
INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
Arches needing L3/Internode cacheline alignment can define
INTERNODE_CACHE_SHIFT in the arch asm/cache.h.  Patch replaces
____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp

With this patch, L1_CACHE_SHIFT_MAX can be killed

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:38 -08:00
Kirill Korotaev
2f659f462d [PATCH] Optimise oom kill of current task
When oom_killer kills current there's no need to call
schedule_timeout_interruptible() since task must die ASAP.

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:45 -08:00
Christoph Lameter
6ce3c4c0ff [PATCH] Move page migration related functions near do_migrate_pages()
Group page migration functions in mempolicy.c

Add a forward declaration for migrate_page_add (like gather_stats()) and use
our new found mobility to group all page migration related function around
do_migrate_pages().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:44 -08:00