Commit Graph

16614 Commits

Author SHA1 Message Date
Mel Gorman
835c134ec4 Add a bitmap that is used to track flags affecting a block of pages
Here is the latest revision of the anti-fragmentation patches.  Of particular
note in this version is special treatment of high-order atomic allocations.
Care is taken to group them together and avoid grouping pages of other types
near them.  Artifical tests imply that it works.  I'm trying to get the
hardware together that would allow setting up of a "real" test.  If anyone
already has a setup and test that can trigger the atomic-allocation problem,
I'd appreciate a test of these patches and a report.  The second major change
is that these patches will apply cleanly with patches that implement
anti-fragmentation through zones.

kernbench shows effectively no performance difference varying between -0.2%
and +2% on a variety of test machines.  Success rates for huge page allocation
are dramatically increased.  For example, on a ppc64 machine, the vanilla
kernel was only able to allocate 1% of memory as a hugepage and this was due
to a single hugepage reserved as min_free_kbytes.  With these patches applied,
17% was allocatable as superpages.  With reclaim-related fixes from Andy
Whitcroft, it was 40% and further reclaim-related improvements should increase
this further.

Changelog Since V28
o Group high-order atomic allocations together
o It is no longer required to set min_free_kbytes to 10% of memory. A value
  of 16384 in most cases will be sufficient
o Now applied with zone-based anti-fragmentation
o Fix incorrect VM_BUG_ON within buffered_rmqueue()
o Reorder the stack so later patches do not back out work from earlier patches
o Fix bug were journal pages were being treated as movable
o Bias placement of non-movable pages to lower PFNs
o More agressive clustering of reclaimable pages in reactions to workloads
  like updatedb that flood the size of inode caches

Changelog Since V27

o Renamed anti-fragmentation to Page Clustering. Anti-fragmentation was giving
  the mistaken impression that it was the 100% solution for high order
  allocations. Instead, it greatly increases the chances high-order
  allocations will succeed and lays the foundation for defragmentation and
  memory hot-remove to work properly
o Redefine page groupings based on ability to migrate or reclaim instead of
  basing on reclaimability alone
o Get rid of spurious inits
o Per-cpu lists are no longer split up per-type. Instead the per-cpu list is
  searched for a page of the appropriate type
o Added more explanation commentary
o Fix up bug in pageblock code where bitmap was used before being initalised

Changelog Since V26
o Fix double init of lists in setup_pageset

Changelog Since V25
o Fix loop order of for_each_rclmtype_order so that order of loop matches args
o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
o Rename get_pageblock_type() to get_page_rclmtype()
o Fix alignment problem in move_freepages()
o Add mechanism for assigning flags to blocks of pages instead of page->flags
o On fallback, do not examine the preferred list of free pages a second time

The purpose of these patches is to reduce external fragmentation by grouping
pages of related types together.  When pages are migrated (or reclaimed under
memory pressure), large contiguous pages will be freed.

This patch works by categorising allocations by their ability to migrate;

Movable - The pages may be moved with the page migration mechanism. These are
	generally userspace pages.

Reclaimable - These are allocations for some kernel caches that are
	reclaimable or allocations that are known to be very short-lived.

Unmovable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

HighAtomic - These are high-order allocations belonging to callers that
	cannot sleep or perform any IO. In practice, this is restricted to
	jumbo frame allocation for network receive. It is assumed that the
	allocations are short-lived

Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
there is one for each type of reclaimability.  Once a 2^MAX_ORDER block of
pages is split for a type of allocation, it is added to the free-lists for
that type, in effect reserving it.  Hence, over time, pages of the different
types can be clustered together.

When the preferred freelists are expired, the largest possible block is taken
from an alternative list.  Buddies that are split from that large block are
placed on the preferred allocation-type freelists to mitigate fragmentation.

This implementation gives best-effort for low fragmentation in all zones.
Ideally, min_free_kbytes needs to be set to a value equal to 4 * (1 <<
(MAX_ORDER-1)) pages in most cases.  This would be 16384 on x86 and x86_64 for
example.

Our tests show that about 60-70% of physical memory can be allocated on a
desktop after a few days uptime.  In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of the
test.  To compare, a standard kernel was getting < 1% of memory as large pages
on a desktop and about 8-12% of memory as large pages at the end of stress
tests.

Following this email are 12 patches that implement thie page grouping feature.
 The first patch introduces a mechanism for storing flags related to a whole
block of pages.  Then allocations are split between movable and all other
allocations.  Following that are patches to deal with per-cpu pages and make
the mechanism configurable.  The next patch moves free pages between lists
when partially allocated blocks are used for pages of another migrate type.
The second last patch groups reclaimable kernel allocations such as inode
caches together.  The final patch related to groupings keeps high-order atomic
allocations.

The last two patches are more concerned with control of fragmentation.  The
second last patch biases placement of non-movable allocations towards the
start of memory.  This is with a view of supporting memory hot-remove of DIMMs
with higher PFNs in the future.  The biasing could be enforced a lot heavier
but it would cost.  The last patch agressively clusters reclaimable pages like
inode caches together.

The fragmentation reduction strategy needs to track if pages within a block
can be moved or reclaimed so that pages are freed to the appropriate list.
This patch adds a bitmap for flags affecting a whole a MAX_ORDER block of
pages.

In non-SPARSEMEM configurations, the bitmap is stored in the struct zone and
allocated during initialisation.  SPARSEMEM statically allocates the bitmap in
a struct mem_section so that bitmaps do not have to be resized during memory
hotadd.  This wastes a small amount of memory per unused section (usually
sizeof(unsigned long)) but the complexity of dynamically allocating the memory
is quite high.

Additional credit to Andy Whitcroft who reviewed up an earlier implementation
of the mechanism an suggested how to make it a *lot* cleaner.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
KAMEZAWA Hiroyuki
954ffcb35f flush icache before set_pte() on ia64: flush icache at set_pte
Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
set_pte().  This is too late.  This patch removes lazy_mmu_prot_update and
add modfied set_pte() for flushing if necessary.

This patch flush icache of a page when
	new pte has exec bit.
	&& new pte has present bit
	&& new pte is user's page.
	&& (old *ptep is not present
            || new pte's pfn is not same to old *ptep's ptn)
	&& new pte's page has no Pg_arch_1 bit.
	   Pg_arch_1 is set when a page is cache consistent.

I think this condition checks are much easier to understand than considering
"Where sync_icache_dcache() should be inserted ?".

pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
clean-up. So, I added it again.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
6cb062296f Categorize GFP flags
The function of GFP_LEVEL_MASK seems to be unclear.  In order to clear up
the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
flags:

GFP_RECLAIM_MASK	Flags used to control page allocator reclaim behavior.

GFP_CONSTRAINT_MASK	Flags used to limit where allocations can occur.

GFP_SLAB_BUG_MASK	Flags that the slab allocator BUG()s on.

These replace the uses of GFP_LEVEL mask in the slab allocators and in
vmalloc.c.

The use of the flags not included in these sets may occur as a result of a
slab allocation standing in for a page allocation when constructing scatter
gather lists.  Extraneous flags are cleared and not passed through to the
page allocator.  __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
now be ignored if passed to a slab allocator.

Change the allocation of allocator meta data in SLAB and vmalloc to not
pass through flags listed in GFP_CONSTRAINT_MASK.  SLAB already removes the
__GFP_THISNODE flag for such allocations.  Generalize that to also cover
vmalloc.  The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.

The impact of allocator metadata placement on access latency to the
cachelines of the object itself is minimal since metadata is only
referenced on alloc and free.  The attempt is still made to place the meta
data optimally but we consistently allow fallback both in SLAB and vmalloc
(SLUB does not need to allocate metadata like that).

Allocator metadata may serve multiple in kernel users and thus should not
be subject to the limitations arising from a single allocation context.

[akpm@linux-foundation.org: fix fallback_alloc()]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
0e1e7c7a73 Memoryless nodes: Use N_HIGH_MEMORY for cpusets
cpusets try to ensure that any node added to a cpuset's mems_allowed is
on-line and contains memory.  The assumption was that online nodes contained
memory.  Thus, it is possible to add memoryless nodes to a cpuset and then add
tasks to this cpuset.  This results in continuous series of oom-kill and
apparent system hang.

Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a.  node_memory_map] in
place of node_online_map when vetting memories.  Return error if admin
attempts to write a non-empty mems_allowed node mask containing only
memoryless-nodes.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
523b945855 Memoryless nodes: Fix GFP_THISNODE behavior
GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
first zone of a nodelist.  That only works if the node has memory.  A
memoryless node will have its first node on another pgdat (node).

GFP_THISNODE currently will return simply memory on the first pgdat.  Thus it
is returning memory on other nodes.  GFP_THISNODE should fail if there is no
local memory on a node.

Add a new set of zonelists for each node that only contain the nodes that
belong to the zones itself so that no fallback is possible.

Then modify gfp_type to pickup the right zone based on the presence of
__GFP_THISNODE.

Drop the existing GFP_THISNODE checks from the page_allocators hot path.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
37c0708dbe Memoryless nodes: Add N_CPU node state
We need the check for a node with cpu in zone reclaim.  Zone reclaim will not
allow remote zone reclaim if a node has a cpu.

[Lee.Schermerhorn@hp.com: Move setup of N_CPU node state mask]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Christoph Lameter
7ea1530ab3 Memoryless nodes: introduce mask of nodes with memory
It is necessary to know if nodes have memory since we have recently begun to
add support for memoryless nodes.  For that purpose we introduce a two new
node states: N_HIGH_MEMORY and N_NORMAL_MEMORY.

A node has its bit in N_HIGH_MEMORY set if it has any memory regardless of the
type of mmemory.  If a node has memory then it has at least one zone defined
in its pgdat structure that is located in the pgdat itself.

A node has its bit in N_NORMAL_MEMORY set if it has a lower zone than
ZONE_HIGHMEM.  This means it is possible to allocate memory that is not
subject to kmap.

N_HIGH_MEMORY and N_NORMAL_MEMORY can then be used in various places to insure
that we do the right thing when we encounter a memoryless node.

[akpm@linux-foundation.org: build fix]
[Lee.Schermerhorn@hp.com: update N_HIGH_MEMORY node state for memory hotadd]
[y-goto@jp.fujitsu.com: Fix memory hotplug + sparsemem build]
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Christoph Lameter
1380891071 Memoryless nodes: Generic management of nodemasks for various purposes
Why do we need to support memoryless nodes?

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> For fujitsu, problem is called "empty" node.
>
> When ACPI's SRAT table includes "possible nodes", ia64 bootstrap(acpi_numa_init)
> creates nodes, which includes no memory, no cpu.
>
> I tried to remove empty-node in past, but that was denied.
> It was because we can hot-add cpu to the empty node.
> (node-hotplug triggered by cpu is not implemented now. and it will be ugly.)
>
>
> For HP, (Lee can comment on this later), they have memory-less-node.
> As far as I hear, HP's machine can have following configration.
>
> (example)
> Node0: CPU0   memory AAA MB
> Node1: CPU1   memory AAA MB
> Node2: CPU2   memory AAA MB
> Node3: CPU3   memory AAA MB
> Node4: Memory XXX GB
>
> AAA is very small value (below 16MB)  and will be omitted by ia64 bootstrap.
> After boot, only Node 4 has valid memory (but have no cpu.)
>
> Maybe this is memory-interleave by firmware config.

Christoph Lameter <clameter@sgi.com> wrote:

> Future SGI platforms (actually also current one can have but nothing like
> that is deployed to my knowledge) have nodes with only cpus. Current SGI
> platforms have nodes with just I/O that we so far cannot manage in the
> core. So the arch code maps them to the nearest memory node.

Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> For the HP platforms, we can configure each cell with from 0% to 100%
> "cell local memory".  When we configure with <100% CLM, the "missing
> percentages" are interleaved by hardware on a cache-line granularity to
> improve bandwidth at the expense of latency for numa-challenged
> applications [and OSes, but not our problem ;-)].  When we boot Linux on
> such a config, all of the real nodes have no memory--it all resides in a
> single interleaved pseudo-node.
>
> When we boot Linux on a 100% CLM configuration [== NUMA], we still have
> the interleaved pseudo-node.  It contains a few hundred MB stolen from
> the real nodes to contain the DMA zone.  [Interleaved memory resides at
> phys addr 0].  The memoryless-nodes patches, along with the zoneorder
> patches, support this config as well.
>
> Also, when we boot a NUMA config with the "mem=" command line,
> specifying less memory than actually exists, Linux takes the excluded
> memory "off the top" rather than distributing it across the nodes.  This
> can result in memoryless nodes, as well.
>

This patch:

Preparation for memoryless node patches.

Provide a generic way to keep nodemasks describing various characteristics of
NUMA nodes.

Remove the node_online_map and the node_possible map and realize the same
functionality using two nodes stats: N_POSSIBLE and N_ONLINE.

[Lee.Schermerhorn@hp.com: Initialize N_*_MEMORY and N_CPU masks for non-NUMA config]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Nick Piggin
55144768e1 fs: remove some AOP_TRUNCATED_PAGE
prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and
GFS2 were converted to the new aops, so we can make some simplifications
for that.

[michal.k.k.piotrowski@gmail.com: fix warning]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Nick Piggin
03158cd7eb fs: restore nobh
Implement nobh in new aops.  This is a bit tricky.  FWIW, nobh_truncate is
now implemented in a way that does not create blocks in sparse regions,
which is a silly thing for it to have been doing (isn't it?)

ext2 survives fsx and fsstress. jfs is converted as well... ext3
should be easy to do (but not done yet).

[akpm@linux-foundation.org: coding-style fixes]
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Nick Piggin
a20fa20c54 With reiserfs no longer using the weird generic_cont_expand, remove it completely.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:56 -07:00
Nick Piggin
89e107877b fs: new cont helpers
Rework the generic block "cont" routines to handle the new aops.  Supporting
cont_prepare_write would take quite a lot of code to support, so remove it
instead (and we later convert all filesystems to use it).

write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
generic_cont_expand, so filesystems can avoid the old hacks they used.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:55 -07:00
Nick Piggin
afddba49d1 fs: introduce write_begin, write_end, and perform_write aops
These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:55 -07:00
Nick Piggin
2f718ffc16 mm: buffered write iterator
Add an iterator data structure to operate over an iovec.  Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:55 -07:00
Nick Piggin
08291429cf mm: fix pagecache write deadlocks
Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:54 -07:00
Lee Schermerhorn
754af6f5a8 Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
Allow an application to query the memories allowed by its context.

Updated numa_memory_policy.txt to mention that applications can use this to
obtain allowed memories for constructing valid policies.

TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new
wrapper--e.g.,  numa_get_mems_allowed() ?

Also, update numa syscall man pages.

Tested with memtoy V>=0.13.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:54 -07:00
Martin Schwidefsky
c92ff1bde0 move mm_struct and vm_area_struct
Move the definitions of struct mm_struct and struct vma_area_struct to
include/mm_types.h.  This allows to define more function in asm/pgtable.h
and friends with inline assemblies instead of macros.  Compile tested on
i386, powerpc, powerpc64, s390-32, s390-64 and x86_64.

[aurelien@aurel32.net: build fix]
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Nick Piggin
c0bc9875b7 radix-tree: use indirect bit
Rather than sign direct radix-tree pointers with a special bit, sign the
indirect one that hangs off the root.  This means that, given a lookup_slot
operation, the invalid result will be differentiated from the valid
(previously, valid results could have the bit either set or clear).

This does not affect slot lookups which occur under lock -- they can never
return an invalid result.  Is needed in future for lockless pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Nick Piggin
557ed1fa26 remove ZERO_PAGE
The commit b5810039a5 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
much use to them except on benchmarks.  All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Christoph Lameter
aadb4bc4a1 SLUB: direct pass through of page size or higher kmalloc requests
This gets rid of all kmalloc caches larger than page size.  A kmalloc
request larger than PAGE_SIZE > 2 is going to be passed through to the page
allocator.  This works both inline where we will call __get_free_pages
instead of kmem_cache_alloc and in __kmalloc.

kfree is modified to check if the object is in a slab page. If not then
the page is freed via the page allocator instead. Roughly similar to what
SLOB does.

Advantages:
- Reduces memory overhead for kmalloc array
- Large kmalloc operations are faster since they do not
  need to pass through the slab allocator to get to the
  page allocator.
- Performance increase of 10%-20% on alloc and 50% on free for
  PAGE_SIZEd allocations.
  SLUB must call page allocator for each alloc anyways since
  the higher order pages which that allowed avoiding the page alloc calls
  are not available in a reliable way anymore. So we are basically removing
  useless slab allocator overhead.
- Large kmallocs yields page aligned object which is what
  SLAB did. Bad things like using page sized kmalloc allocations to
  stand in for page allocate allocs can be transparently handled and are not
  distinguishable from page allocator uses.
- Checking for too large objects can be removed since
  it is done by the page allocator.

Drawbacks:
- No accounting for large kmalloc slab allocations anymore
- No debugging of large kmalloc slab allocations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Fengguang Wu
57f6b96c09 filemap: convert some unsigned long to pgoff_t
Convert some 'unsigned long' to pgoff_t.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Fengguang Wu
535443f515 readahead: remove several readahead macros
Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu
6df8ba4f8a radixtree: introduce radix_tree_next_hole()
Introduce radix_tree_next_hole(root, index, max_scan) to scan radix tree for
the first hole.  It will be used in interleaved readahead.

The implementation is dumb and obviously correct.  It can help debug(and
document) the possible smart one in future.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu
f4e6b498d6 readahead: combine file_ra_state.prev_index/prev_offset into prev_pos
Combine the file_ra_state members
				unsigned long prev_index
				unsigned int prev_offset
into
				loff_t prev_pos

It is more consistent and better supports huge files.

Thanks to Peter for the nice proposal!

[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu
0bb7ba6b9c readahead: mmap read-around simplification
Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss and make it an int.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu
937085aa35 readahead: compacting file_ra_state
Use 'unsigned int' instead of 'unsigned long' for readahead sizes.

This helps reduce memory consumption on 64bit CPU when a lot of files are
opened.

CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Jesper Juhl
39e91e4331 Clean up duplicate includes in include/linux/memory_hotplug.h
This patch cleans up duplicate includes in
	include/linux/memory_hotplug.h

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Andy Whitcroft
d29eff7bca ppc64: SPARSEMEM_VMEMMAP support
Enable virtual memmap support for SPARSEMEM on PPC64 systems.  Slice a 16th
off the end of the linear mapping space and use that to hold the vmemmap.
Uses the same size mapping as uses in the linear 1:1 kernel mapping.

[pbadari@gmail.com: fix warning]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
David Miller
46644c2477 SPARC64: SPARSEMEM_VMEMMAP support
[apw@shadowen.org: style fixups]
[apw@shadowen.org: vmemmap sparc64: convert to new config options]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Christoph Lameter
ef229c5a5e IA64: SPARSEMEM_VMEMMAP 16K page size support
Equip IA64 sparsemem with a virtual memmap.  This is similar to the existing
CONFIG_VIRTUAL_MEM_MAP functionality for DISCONTIGMEM.  It uses a PAGE_SIZE
mapping.

This is provided as a minimally intrusive solution.  We split the 128TB
VMALLOC area into two 64TB areas and use one for the virtual memmap.

This should replace CONFIG_VIRTUAL_MEM_MAP long term.

[apw@shadowen.org: convert to new helper based initialisation]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Christoph Lameter
0889eba5b3 x86_64: SPARSEMEM_VMEMMAP 2M page size support
x86_64 uses 2M page table entries to map its 1-1 kernel space.  We also
implement the virtual memmap using 2M page table entries.  So there is no
additional runtime overhead over FLATMEM, initialisation is slightly more
complex.  As FLATMEM still references memory to obtain the mem_map pointer and
SPARSEMEM_VMEMMAP uses a compile time constant, SPARSEMEM_VMEMMAP should be
superior.

With this SPARSEMEM becomes the most efficient way of handling virt_to_page,
pfn_to_page and friends for UP, SMP and NUMA on x86_64.

[apw@shadowen.org: code resplit, style fixups]
[apw@shadowen.org: vmemmap x86_64: ensure end of section memmap is initialised]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Andy Whitcroft
29c71111d0 vmemmap: generify initialisation via helpers
Convert the common vmemmap population into initialisation helpers for use by
architecture vmemmap populators.  All architecture implementing the
SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
initialiser, which may make use of the helpers.

This allows us to clean up and remove the initialisation Kconfig entries.
With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
indicate use of that variant.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Christoph Lameter
8f6aac419b Generic Virtual Memmap support for SPARSEMEM
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
the arches.  It would be great if it could be the default so that we can get
rid of various forms of DISCONTIG and other variations on memory maps.  So far
what has hindered this are the additional lookups that SPARSEMEM introduces
for virt_to_page and page_address.  This goes so far that the code to do this
has to be kept in a separate function and cannot be used inline.

This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
is mapped into a virtually contigious area, only the active sections are
physically backed.  This allows virt_to_page page_address and cohorts become
simple shift/add operations.  No page flag fields, no table lookups, nothing
involving memory is required.

The two key operations pfn_to_page and page_to_page become:

   #define __pfn_to_page(pfn)      (vmemmap + (pfn))
   #define __page_to_pfn(page)     ((page) - vmemmap)

By having a virtual mapping for the memmap we allow simple access without
wasting physical memory.  As kernel memory is typically already mapped 1:1
this introduces no additional overhead.  The virtual mapping must be big
enough to allow a struct page to be allocated and mapped for all valid
physical pages.  This vill make a virtual memmap difficult to use on 32 bit
platforms that support 36 address bits.

However, if there is enough virtual space available and the arch already maps
its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
FLATMEM needs to read the contents of the mem_map variable to get the start of
the memmap and then add the offset to the required entry.  vmemmap is a
constant to which we can simply add the offset.

This patch has the potential to allow us to make SPARSMEM the default (and
even the only) option for most systems.  It should be optimal on UP, SMP and
NUMA on most platforms.  Then we may even be able to remove the other memory
models: FLATMEM, DISCONTIG etc.

[apw@shadowen.org: config cleanups, resplit code etc]
[kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
[apw@shadowen.org: vmemmap: remove excess debugging]
[apw@shadowen.org: simplify initialisation code and reduce duplication]
[apw@shadowen.org: pull out the vmemmap code into its own file]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Andy Whitcroft
540557b943 sparsemem: record when a section has a valid mem_map
We have flags to indicate whether a section actually has a valid mem_map
associated with it.  This is never set and we rely solely on the present bit
to indicate a section is valid.  By definition a section is not valid if it
has no mem_map and there is a window during init where the present bit is set
but there is no mem_map, during which pfn_valid() will return true
incorrectly.

Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid
mem_map.  Switch valid_section{,_nr} and pfn_valid() to this bit.  Add a new
present_section{,_nr} and pfn_present() interfaces for those users who care to
know that a section is going to be valid.

[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Christoph Hellwig
74a0b57627 x86: optimize page faults like all other achitectures and kill notifier cruft
x86(-64) are the last architectures still using the page fault notifier
cruft for the kprobes page fault hook.  This patch converts them to the
proper direct calls, and removes the now unused pagefault notifier bits
aswell as the cruft in kprobes.c that was related to this mess.

I know Andi didn't really like this, but all other architecture maintainers
agreed the direct calls are much better and besides the obvious cruft
removal a common way of dealing with kprobes across architectures is
important aswell.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc64]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andi Kleen <ak@suse.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Mike Travis
d5a7430ddc Convert cpu_sibling_map to be a per cpu variable
Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
variable.  This saves sizeof(cpumask_t) * NR unused cpus.  Access is mostly
from startup and CPU HOTPLUG functions.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Mike Travis
0835761129 x86: Convert cpu_core_map to be a per cpu variable
This is from an earlier message from 'Christoph Lameter':

    cpu_core_map is currently an array defined using NR_CPUS. This means that
    we overallocate since we will rarely really use maximum configured cpu.

    If we put the cpu_core_map into the per cpu area then it will be allocated
    for each processor as it comes online.

    This means that the core map cannot be accessed until the per cpu area
    has been allocated. Xen does a weird thing here looping over all processors
    and zeroing the masks that are not yet allocated and that will be zeroed
    when they are allocated. I commented the code out.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Guennadi Liakhovetski
b3b708fa27 wake up from a serial port
Enable wakeup from serial ports, make it run-time configurable over sysfs,
e.g.,

echo enabled > /sys/devices/platform/serial8250.0/tty/ttyS0/power/wakeup

Requires

# CONFIG_SYSFS_DEPRECATED is not set

Following suggestions from Alan and Russell moved the may_wake_up checks
to serial_core.c. This time actually tested - it does even work. Could
someone, please, verify, that put_device after device_find_child is
correct?

Also would be nice to test with a Natsemi UART, that can wake up the system,
if such systems exist.

For this you just have to apply the patch below, issue the above "echo"
command to one of your Natsemi port, suspend and resume your system, and
verify that your Natsemi port still works.  If you are actually capable of
waking up the system from that port, would be nice to test that as well.

Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Guennadi Liakhovetski
aa5346a212 provide stubs for enable_irq_wake() and disable_irq_wake()
Provide {enable,disable}_irq_wakeup dummies for undefined
cross-compilers for platforms without CONFIG_GENERIC_IRQ.

Needed by wake-up-from-a-serial-port.patch

Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Alan Cox
bf0df636e5 8250_pci: Autodetect mainpine cards
Add support for a whole range of boards. Some are partly autodetected but
not fully correctly others (PCI Express notably) not at all. Stick all
the right entries in.

Thanks to Mainpine for information and testing.

Signed-off-by: Alan Cox <alan@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
James Bottomley
43d9f7fda1 pcmcia: use DMA_MASK_NONE for the default for all pcmcia devices
Most non cardbus devices can't do dma, so flag them as such in the device
creation routine.

Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Natalie Protasevich <protasnb@gmail.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
James Bottomley
32e8f70230 introduce DMA_MASK_NONE as a signal for unable to do DMA
Some devices are incapable of DMA and need to be recognised as such.
Introduce a NONE dma mask to facilitate this plus an inline function:
is_device_dma_capable() to check this.

Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Natalie Protasevich <protasnb@gmail.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Yoichi Yuasa
b5446b514c move a few definitions to au1000_xxs1500.c
Only a few definitions is in xxs1500.h .
They can be move to au1000_xxs1500.c .

[m.kozlowski@tuxland.pl: fix unbalanced parenthesis]
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Ralf Baechle
0322a2b840 Add assembler equivalents to __init{,date}_refok
I need __INIT_REFOK to fix a MODPOST warning for a few MIPS configs which
have to call init code from .text very early in the game due to bootloader
issues.  __INITDATA_REFOK is just for consistency.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:49 -07:00
Randy Dunlap
bfe8df3d31 slow down printk during boot
Optionally add a boot delay after each kernel printk() call, crudely
measured in milliseconds, with a maximum delay of 10 seconds per printk.

Enable CONFIG_BOOT_PRINTK_DELAY=y and then add (e.g.):
"lpj=loops_per_jiffy boot_delay=100"
to the kernel command line.

It has been useful in cases like "during boot, my machine just reboots or the
screen goes black" by slowing down printk, (and adding initcall_debug), we can
usually see the last thing that happened before the lights went out which is
usually a valuable clue.

[akpm@linux-foundation.org: not all architectures implement CONFIG_HZ]
[akpm@linux-foundation.org: fix lots of stuff]
[bunk@stusta.de: kernel/printk.c: make 2 variables static]
[heiko.carstens@de.ibm.com: fix slow down printk on boot compile error]
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:49 -07:00
Randy Dunlap
e6716b87d5 docbook: fix filesystems content
Fix filesystems docbook warnings.

Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'name'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'mode'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'parent'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'value'
Warning(linux-2.6.23-git8//include/linux/jbd.h:404): No description found for parameter 'h_lockdep_map'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-15 17:56:36 -07:00
Randy Dunlap
fd39c86b3d docbook: fix usb content
Fix USB docbook warnings.

Warning(linux-2.6.23-git8//include/linux/usb/gadget.h:487): No description found for parameter 'g'
Warning(linux-2.6.23-git8//include/linux/usb/gadget.h:506): No description found for parameter 'g'

Warning(linux-2.6.23-git8//drivers/usb/core/hub.c:1416): No description found for parameter 'usb_dev'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-15 17:56:36 -07:00
Linus Torvalds
65a6ec0d72 Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (95 commits)
  [ARM] 4578/1: CM-x270: PCMCIA support
  [ARM] 4577/1: ITE 8152 PCI bridge support
  [ARM] 4576/1: CM-X270 machine support
  [ARM] pxa: Avoid pxa_gpio_mode() in gpio_direction_{in,out}put()
  [ARM] pxa: move pxa_set_mode() from pxa2xx_mainstone.c to mainstone.c
  [ARM] pxa: move pxa_set_mode() from pxa2xx_lubbock.c to lubbock.c
  [ARM] pxa: Make cpu_is_pxaXXX dependent on configuration symbols
  [ARM] pxa: PXA3xx base support
  [NET] smc91x: fix PXA DMA support code
  [SERIAL] Fix console initialisation ordering
  [ARM] pxa: tidy up arch/arm/mach-pxa/Makefile
  [ARM] Update arch/arm/Kconfig for drivers/Kconfig changes
  [ARM] 4600/1: fix kernel build failure with build-id-supporting binutils
  [ARM] 4599/1: Preserve ATAG list for use with kexec (2.6.23)
  [ARM] Rename consistent_sync() as dma_cache_maint()
  [ARM] 4572/1: ep93xx: add cirrus logic edb9307 support
  [ARM] 4596/1: S3C2412: Correct IRQs for SDI+CF and add decoding support
  [ARM] 4595/1: ns9xxx: define registers as void __iomem * instead of volatile u32
  [ARM] 4594/1: ns9xxx: use the new gpio functions
  [ARM] 4593/1: ns9xxx: implement generic clockevents
  ...
2007-10-15 16:08:50 -07:00
Linus Torvalds
541010e4b8 Merge branch 'locks' of git://linux-nfs.org/~bfields/linux
* 'locks' of git://linux-nfs.org/~bfields/linux:
  nfsd: remove IS_ISMNDLCK macro
  Rework /proc/locks via seq_files and seq_list helpers
  fs/locks.c: use list_for_each_entry() instead of list_for_each()
  NFS: clean up explicit check for mandatory locks
  AFS: clean up explicit check for mandatory locks
  9PFS: clean up explicit check for mandatory locks
  GFS2: clean up explicit check for mandatory locks
  Cleanup macros for distinguishing mandatory locks
  Documentation: move locks.txt in filesystems/
  locks: add warning about mandatory locking races
  Documentation: move mandatory locking documentation to filesystems/
  locks: Fix potential OOPS in generic_setlease()
  Use list_first_entry in locks_wake_up_blocks
  locks: fix flock_lock_file() comment
  Memory shortage can result in inconsistent flocks state
  locks: kill redundant local variable
  locks: reverse order of posix_locks_conflict() arguments
2007-10-15 16:07:40 -07:00
Linus Torvalds
e457f790d8 Merge branch 'release' of ssh://master.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of ssh://master.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] build fix for scatterlist
2007-10-15 15:32:57 -07:00