The zone table is mostly not needed. If we have a node in the page flags
then we can get to the zone via NODE_DATA() which is much more likely to be
already in the cpu cache.
In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
access an exact replica of zonetable in the node_zones field. In all of
the above cases there will be no need at all for the zone table.
The only remaining case is if in a NUMA system the node numbers do not fit
into the page flags. In that case we make sparse generate a table that
maps sections to nodes and use that table to to figure out the node number.
This table is sized to fit in a single cache line for the known 32 bit
NUMA platform which makes it very likely that the information can be
obtained without a cache miss.
For sparsemem the zone table seems to be have been fairly large based on
the maximum possible number of sections and the number of zones per node.
There is some memory saving by removing zone_table. The main benefit is to
reduce the cache foootprint of the VM from the frequent lookups of zones.
Plus it simplifies the page allocator.
[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- s/freeliest/freelist/ spelling fix
- Check for NULL *z zone seems useless - even if it could happen, so
what? Perhaps we should have a check later on if we are faced with an
allocation request that is not allowed to fail - shouldn't that be a
serious kernel error, passing an empty zonelist with a mandate to not
fail?
- Initializing 'z' to zonelist->zones can wait until after the first
get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
loop, so let's initialize 'z' there, in a 'for' loop. Seems clearer.
- Remove superfluous braces around a break
- Fix a couple errant spaces
- Adjust indentation on the cpuset_zone_allowed() check, to match the
lines just before it -- seems easier to read in this case.
- Add another set of braces to the zone_watermark_ok logic
From: Paul Jackson <pj@sgi.com>
Backout one item from a previous "memory page_alloc minor cleanups" patch.
Until and unless we are certain that no one can ever pass an empty zonelist
to __alloc_pages(), this check for an empty zonelist (or some BUG
equivalent) is essential. The code in get_page_from_freelist() blow ups if
passed an empty zonelist.
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.infradead.org/users/dhowells/workq-2.6:
Actually update the fixed up compile failures.
WorkQueue: Fix up arch-specific work items where possible
WorkStruct: make allyesconfig
WorkStruct: Pass the work_struct pointer instead of context data
WorkStruct: Merge the pending bit into the wq_data pointer
WorkStruct: Typedef the work function prototype
WorkStruct: Separate delayable and non-delayable events.
I was playing with blackfin when i hit a neat bug ... doing an open() on a
directory and then passing that fd to mmap() would cause the kernel to hang
after poking into the code a bit more, i found that
mm/nommu.c:validate_mmap_request() checks the length and if it is 0, just
returns the address ... this is in stark contrast to mmu's
mm/mmap.c:do_mmap_pgoff() where it returns -EINVAL for 0 length requests ...
i then noticed that some other parts of the logic is out of date between the
two funcs, so perhaps that's the easy fix ?
Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conflicts:
drivers/ata/libata-scsi.c
include/linux/libata.h
Futher merge of Linus's head and compilation fixups.
Signed-Off-By: David Howells <dhowells@redhat.com>
Conflicts:
drivers/infiniband/core/iwcm.c
drivers/net/chelsio/cxgb2.c
drivers/net/wireless/bcm43xx/bcm43xx_main.c
drivers/net/wireless/prism54/islpci_eth.c
drivers/usb/core/hub.h
drivers/usb/input/hid-core.c
net/core/netpoll.c
Fix up merge failures with Linus's head and fix new compilation failures.
Signed-Off-By: David Howells <dhowells@redhat.com>
find_min_pfn_for_node() and find_min_pfn_with_active_regions() both
depend on a sorted early_node_map[]. However, sort_node_map() is being
called after fin_min_pfn_with_active_regions() in
free_area_init_nodes().
In most cases, this is ok, but on at least one x86_64, the SRAT table
caused the E820 ranges to be registered out of order. This gave the
wrong values for the min PFN range resulting in some pages not being
initialised.
This patch sorts the early_node_map in find_min_pfn_for_node(). It has
been boot tested on x86, x86_64, ppc64 and ia64.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.
For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.
To make this work, an extra flag is introduced into the management side of the
work_struct. This governs auto-release of the structure upon execution.
Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function. This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated.. This is a
problem if the auxiliary data is taken away (as done by the last patch).
However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems. But then the work function must itself release the
work_struct by calling work_release().
In most cases, automatic release is fine, so this is the default. Special
initiators exist for the non-auto-release case (ending in _NAR).
Signed-Off-By: David Howells <dhowells@redhat.com>
Separate delayable work items from non-delayable work items be splitting them
into a separate structure (delayed_work), which incorporates a work_struct and
the timer_list removed from work_struct.
The work_struct struct is huge, and this limits it's usefulness. On a 64-bit
architecture it's nearly 100 bytes in size. This reduces that by half for the
non-delayable type of event.
Signed-Off-By: David Howells <dhowells@redhat.com>
Recently, __get_vm_area_node() was changed like following
if (unlikely(!area))
return NULL;
- if (unlikely(!size)) {
- kfree (area);
+ if (unlikely(!size))
return NULL;
- }
It is leaking `area', also original code seems strange already.
Probably, we wanted to do this patch.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commit cb07c9a186 causes the wrong return
value. is_hugepage_only_range() is a boolean, so we should return
-EINVAL rather than 1.
Also - we can use "mm" instead of looking up "current->mm" again.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Unlike mmap(), the codepath for brk() creates a vma without first checking
that it doesn't touch a region exclusively reserved for hugepages. On
powerpc, this can allow it to create a normal page vma in a hugepage
region, causing oopses and other badness.
Add a test to prevent this. With this patch, brk() will simply fail if it
attempts to move the break into a hugepage reserved region.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
(David:)
If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
because the given file offset is not hugepage aligned - then do_mmap_pgoff
will go to the unmap_and_free_vma backout path.
But at this stage the vma hasn't been marked as hugepage, and the backout path
will call unmap_region() on it. That will eventually call down to the
non-hugepage version of unmap_page_range(). On ppc64, at least, that will
cause serious problems if there are any existing hugepage pagetable entries in
the vicinity - for example if there are any other hugepage mappings under the
same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud
entries. I suspect this will also cause bad problems on ia64, though I don't
have a machine to test it on.
(Hugh:)
prepare_hugepage_range() should check file offset alignment when it checks
virtual address and length, to stop MAP_FIXED with a bad huge offset from
unmapping before it fails further down. PowerPC should apply the same
prepare_hugepage_range alignment checks as ia64 and all the others do.
Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
is the check for too small a mapping); but even so, move up setting of
VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
when unwinding from error will go the non-huge way, which may cause bad
behaviour on architectures (powerpc and ia64) which segregate their huge
mappings into a separate region of the address space.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- reorder 'struct vm_struct' to speedup lookups on CPUS with small cache
lines. The fields 'next,addr,size' should be now in the same cache line,
to speedup lookups.
- One minor cleanup in __get_vm_area_node()
- Bugfixes in vmalloc_user() and vmalloc_32_user() NULL returns from
__vmalloc() and __find_vm_area() were not tested.
[akpm@osdl.org: remove redundant BUG_ONs]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_move_pages() uses vmalloc() to allocate an array of structures that is
fills with information passed from user mode and then passes to
do_stat_pages() (in the case the node list is NULL). do_stat_pages()
depends on a marker in the node field of the structure to decide how large
the array is and this marker is correctly inserted into the last element of
the array. However, vmalloc() doesn't zero the memory it allocates and if
the user passes NULL for the node list, then the node fields are not filled
in (except for the end marker). If the memory the vmalloc() returned
happend to have a word with the marker value in it in just the right place,
do_pages_stat will fail to fill the status field of part of the array and
we will return (random) kernel data to user mode.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It looks like there is a bug in init_reap_node() in slab.c that can cause
multiple oops's on certain ES7000 configurations. The variable reap_node
is defined per cpu, but only initialized on a single CPU. This causes an
oops in next_reap_node() when __get_cpu_var(reap_node) returns the wrong
value. Fix is below.
Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com>
Cc: Andi Kleen <ak@suse.de>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Current read_pages() assume ->readpages() frees the passed pages.
This patch free the pages in ->read_pages(), if those were remaining in the
pages_list. So, readpages() just can ignore the remaining pages in
pages_list.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Un-needed add-store operation wastes a few bytes.
8 bytes wasted with -O2, on a ppc.
Signed-off-by: nkalmala <nkalmala@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As reported by Martin J. Bligh <mbligh@google.com>, we let through some
non-slab bits to slab allocation through __get_vm_area_node when doing a
vmalloc.
I haven't been able to reproduce this, although I understand why it
happens: vmalloc allocates memory with
GFP_KERNEL | __GFP_HIGHMEM
and commit 52fd24ca1d resulted in the same
flags are passed down to cache_alloc_refill, causing the BUG. The
following patch fixes it.
Note that when calling kmalloc_node, I am masking off __GFP_HIGHMEM with
GFP_LEVEL_MASK, whereas __vmalloc_area_node does the same with
~(__GFP_HIGHMEM | __GFP_ZERO).
IMHO, using GFP_LEVEL_MASK is preferable, but either should fix this
problem.
Signed-off-by: Giridhar Pemmasani (pgiri@yahoo.com)
Cc: Martin J. Bligh <mbligh@google.com>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
absent_pages_in_range() made the assumption that users of the
arch-independent zone-sizing API would not care about holes beyound the end
of physical memory. This was not the case and was "fixed" in a patch
called "Account for holes that are outside the range of physical memory".
However, when given a range that started before a hole in "real" memory and
ended beyond the end of memory, it would get the result wrong. The bug is
in mainline but a patch is below.
It has been tested successfully on a number of machines and architectures.
Additional credit to Keith Mannthey for discovering the problem, helping
identify the correct fix and confirming it Worked For Him.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: keith mannthey <kmannth@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative". Reinstate my
preliminary i_size check before attempting to allocate the page (though this
only fixes the most obvious case: more work will be needed here).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If __vmalloc is called to allocate memory with GFP_ATOMIC in atomic
context, the chain of calls results in __get_vm_area_node allocating memory
for vm_struct with GFP_KERNEL, causing the 'sleeping from invalid context'
warning. This patch fixes it by passing the gfp flags along so
__get_vm_area_node allocates memory for vm_struct with the same flags.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add __GFP_NOWARN flag to calling of __alloc_pages() in
__kmalloc_section_memmap(). It can reduce noisy failure message.
In ia64, section size is 1 GB, this means that order 8 pages are necessary
for each section's memmap. It is often very hard requirement under heavy
memory pressure as you know. So, __alloc_pages() gives up allocation and
shows many noisy stack traces which means no page for each sections.
(Current my environment shows 32 times of stack trace....)
But, __kmalloc_section_memmap() calls vmalloc() after failure of it, and it
can succeed allocation of memmap. So, its stack trace warning becomes just
noisy. I suppose it shouldn't be shown.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If try_to_free_pages / balance_pgdat are called with a gfp_mask specifying
GFP_IO and/or GFP_FS, they will reclaim the requisite number of pages, and the
reset prev_priority to DEF_PRIORITY (or to some other high (ie: unurgent)
value).
However, another reclaimer without those gfp_mask flags set (say, GFP_NOIO)
may still be struggling to reclaim pages. The concurrent overwrite of
zone->prev_priority will cause this GFP_NOIO thread to unexpectedly cease
deactivating mapped pages, thus causing reclaim difficulties.
Fix this is to key the distress calculation not off zone->prev_priority, but
also take into account the local caller's priority by using
min(zone->prev_priority, sc->priority)
Signed-off-by: Martin J. Bligh <mbligh@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The temp_priority field in zone is racy, as we can walk through a reclaim
path, and just before we copy it into prev_priority, it can be overwritten
(say with DEF_PRIORITY) by another reclaimer.
The same bug is contained in both try_to_free_pages and balance_pgdat, but
it is fixed slightly differently. In balance_pgdat, we keep a separate
priority record per zone in a local array. In try_to_free_pages there is
no need to do this, as the priority level is the same for all zones that we
reclaim from.
Impact of this bug is that temp_priority is copied into prev_priority, and
setting this artificially high causes reclaimers to set distress
artificially low. They then fail to reclaim mapped pages, when they are,
in fact, under severe memory pressure (their priority may be as low as 0).
This causes the OOM killer to fire incorrectly.
From: Andrew Morton <akpm@osdl.org>
__zone_reclaim() isn't modifying zone->prev_priority. But zone->prev_priority
is used in the decision whether or not to bring mapped pages onto the inactive
list. Hence there's a risk here that __zone_reclaim() will fail because
zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
stuck on the active list.
Fix that up by decreasing (ie making more urgent) zone->prev_priority as
__zone_reclaim() scans the zone's pages.
This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created. It should be
possible to remove that now, and to just start out at DEF_PRIORITY?
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Consolidate page_cache_alloc
- Fix splice: only the pagecache pages and filesystem data need to use
mapping_gfp_mask.
- Fix grab_cache_page_nowait: same as splice, also honour NUMA placement.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The zonelist may contain zones of nodes that have not been bootstrapped and
we will oops if we try to allocate from those zones. So check if the node
information for the slab and the node have been setup before attempting an
allocation. If it has not been setup then skip that zone.
Usually we will not encounter this situation since the slab bootstrap code
avoids falling back before we have setup the respective nodes but we seem
to have a special needs for pppc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reintroduce NODES_SPAN_OTHER_NODES for powerpc
Revert "[PATCH] Remove SPAN_OTHER_NODES config definition"
This reverts commit f62859bb68.
Revert "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"
This reverts commit a94b3ab7ea.
Also update the comments to indicate that this is still required
and where its used.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
[PATCH] Remove SUID when splicing into an inode
[PATCH] Add lockless helpers for remove_suid()
[PATCH] Introduce generic_file_splice_write_nolock()
[PATCH] Take i_mutex in splice_from_pipe()
--=-=-=
from mm/memory.c:
1434 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va)
1435 {
1436 /*
1437 * If the source page was a PFN mapping, we don't have
1438 * a "struct page" for it. We do a best-effort copy by
1439 * just copying from the original user address. If that
1440 * fails, we just zero-fill it. Live with it.
1441 */
1442 if (unlikely(!src)) {
1443 void *kaddr = kmap_atomic(dst, KM_USER0);
1444 void __user *uaddr = (void __user *)(va & PAGE_MASK);
1445
1446 /*
1447 * This really shouldn't fail, because the page is there
1448 * in the page tables. But it might just be unreadable,
1449 * in which case we just give up and fill the result with
1450 * zeroes.
1451 */
1452 if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
1453 memset(kaddr, 0, PAGE_SIZE);
1454 kunmap_atomic(kaddr, KM_USER0);
#### D-cache have to be flushed here.
#### It seems it is just forgotten.
1455 return;
1456
1457 }
1458 copy_user_highpage(dst, src, va);
#### Ok here. flush_dcache_page() called from this func if arch need it
1459 }
Following is the patch fix this issue:
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Qooting Adrian:
- net/sunrpc/svc.c uses highest_possible_node_id()
- include/linux/nodemask.h says highest_possible_node_id() is
out-of-line #if MAX_NUMNODES > 1
- the out-of-line highest_possible_node_id() is in lib/cpumask.c
- lib/Makefile: lib-$(CONFIG_SMP) += cpumask.o
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y, CONFIG_SMP=n, CONFIG_SUNRPC=y
-> highest_possible_node_id() is used in net/sunrpc/svc.c
CONFIG_NODES_SHIFT defined and > 0
-> include/linux/numa.h: MAX_NUMNODES > 1
-> compile error
The bug is not present on architectures where ARCH_DISCONTIGMEM_ENABLE
depends on NUMA (but m32r isn't the only affected architecture).
So move the function into page_alloc.c
Cc: Adrian Bunk <bunk@stusta.de>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Despite mm.h is not being exported header, it does contain one thing
which is part of userspace ABI -- value disabling OOM killer for given
process. So,
a) create and export include/linux/oom.h
b) move OOM_DISABLE define there.
c) turn bounding values of /proc/$PID/oom_adj into defines and export
them too.
Note: mass __KERNEL__ removal will be done later.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.
The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.
This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.
Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When direct-io falls back to buffered write, it will just leave the dirty data
floating about in pagecache, pending regular writeback.
But normal direct-io semantics are that IO is synchronous, and that it leaves
no pagecache behind.
So change the fallback-to-buffered-write code to sync the file region and to
then strip away the pagecache, just as a regular direct-io write would do.
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Right now users have to grab i_mutex before calling remove_suid(), in the
unlikely event that a call to ->setattr() may be needed. Split up the
function in two parts:
- One to check if we need to remove suid
- One to actually remove it
The first we can call lockless.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
A recent change to the vmalloc() code accidentally resulted in us passing
__GFP_ZERO into the slab allocator. But we only wanted __GFP_ZERO for the
actual pages whcih are being vmalloc()ed, and passing __GFP_ZERO into slab is
not a rational thing to ask for.
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We need to encode a decode the 'file' part of a handle. We simply use the
inode number and generation number to construct the filehandle.
The generation number is the time when the file was created. As inode numbers
cycle through the full 32 bits before being reused, there is no real chance of
the same inum being allocated to different files in the same second so this is
suitably unique. Using time-of-day rather than e.g. jiffies makes it less
likely that the same filehandle can be created after a reboot.
In order to be able to decode a filehandle we need to be able to lookup by
inum, which means that the inode needs to be added to the inode hash table
(tmpfs doesn't currently hash inodes as there is never a need to lookup by
inum). To avoid overhead when not exporting, we only hash an inode when it is
first exported. This requires a lock to ensure it isn't hashed twice.
This code is separate from the patch posted in June06 from Atal Shargorodsky
which provided the same functionality, but does borrow slightly from it.
Locking comment: Most filesystems that hash their inodes do so at the point
where the 'struct inode' is initialised, and that has suitable locking
(I_NEW). Here in shmem, we are hashing the inode later, the first time we
need an NFS file handle for it. We no longer have I_NEW to ensure only one
thread tries to add it to the hash table.
Cc: Atal Shargorodsky <atal@codefidence.com>
Cc: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: David M. Grimes <dgrimes@navisite.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If remove_mapping() failed to remove the page from its mapping, don't go and
mark it not uptodate! Makes kernel go dead.
(Actually, I don't think the ClearPageUptodate is needed there at all).
Says Nick Piggin:
"Right, it isn't needed because at this point the page is guaranteed
by remove_mapping to have no references (except us) and cannot pick
up any new ones because it is removed from pagecache.
We can delete it."
Signed-off-by: Andrew Morton <akpm@osdl.org>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
.. and clean up the file mapping code while at it. No point in having a
"if (file)" repeated twice, and generally doing similar checks in two
different sections of the same code
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If try_to_release_page() is called with a zero gfp mask, then the
filesystem is effectively denied the possibility of sleeping while
attempting to release the page. There doesn't appear to be any valid
reason why this should be banned, given that we're not calling this from a
memory allocation context.
For this reason, change the gfp_mask argument of the call to GFP_KERNEL.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Steve Dickson <SteveD@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A failure in invalidate_inode_pages2_range() can result in unpleasant things
happening in NFS (at least). Stick a WARN_ON_ONCE() in there so we can find
out if it happens, and maybe why.
(akpm: might be a -mm-only patch, we'll see..)
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Steve Dickson <SteveD@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the lock debug checks below the page reserved checks. Also, having
debug_check_no_locks_freed in kernel_map_pages is wrong.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
After the PG_reserved check was added, arch_free_page was being called in the
wrong place (it could be called for a page we don't actually want to free).
Fix that.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With CONFIG_MIGRATION=n
mm/mempolicy.c: In function 'do_mbind':
mm/mempolicy.c:796: warning: passing argument 2 of 'migrate_pages' from incompatible pointer type
Signed-off-by: Keith Owens <kaos@ocs.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We have a persistent dribble of reports of this BUG triggering. Its extended
diagnostics were recently made conditional on CONFIG_DEBUG_VM, which was a bad
idea - we want to know about it.
Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>