Swap is duplicated (reference count incremented by one) whenever the same
swap page is inserted into another mm (when forking finds a swap entry in
place of a pte, or when reclaim unmaps a pte to insert the swap entry).
swap_info_struct's vmalloc'ed swap_map is the array of these reference
counts: but what happens when the unsigned short (or unsigned char since
the preceding patch) is full? (and its high bit is kept for a cache flag)
We then lose track of it, never freeing, leaving it in use until swapoff:
at which point we _hope_ that a single pass will have found all instances,
assume there are no more, and will lose user data if we're wrong.
Swapping of KSM pages has not yet been enabled; but it is implemented,
and makes it very easy for a user to overflow the maximum swap count:
possible with ordinary process pages, but unlikely, even when pid_max
has been raised from PID_MAX_DEFAULT.
This patch implements swap count continuations: when the count overflows,
a continuation page is allocated and linked to the original vmalloc'ed
map page, and this used to hold the continuation counts for that entry
and its neighbours. These continuation pages are seldom referenced:
the common paths all work on the original swap_map, only referring to
a continuation page when the low "digit" of a count is incremented or
decremented through SWAP_MAP_MAX.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
chars: it's still very unusual to reach a swap count of 126, and the
next patch allows it to be extended indefinitely.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Though swap_count() is useful, I'm finding that swap_has_cache() and
encode_swapmap() obscure what happens in the swap_map entry, just at
those points where I need to understand it. Remove them, and pass
more usable "usage" values to scan_swap_map(), swap_entry_free() and
__swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move CONFIG_HIBERNATION's swapdev_block() into the main CONFIG_HIBERNATION
block, remove extraneous whitespace and return, fix typo in a comment.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make better use of the space by folding first swap_extent into its
swap_info_struct, instead of just the list_head: swap partitions need
only that one, and for others it's used as a circular list anyway.
[jirislaby@gmail.com: fix crash on double swapon]
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
to reserve an array of about 30 of them in bss, when most people will
want only one. Change swap_info[] to an array of pointers.
That does need a "type" field in the structure: pack it as a char with
next type and short prio (aha, char is unsigned by default on PowerPC).
Use the (admittedly peculiar) name "type" throughout for this index.
/proc/swaps does not take swap_lock: I wouldn't want it to, but do take
care with barriers when adding a new item to the array (never removed).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The swap_info_struct is mostly private to mm/swapfile.c, with only
one other in-tree user: get_swap_bio(). Adjust its interface to
map_swap_page(), so that we can then remove get_swap_info_struct().
But there is a popular user out-of-tree, TuxOnIce: so leave the
declaration of swap_info_struct in linux/swap.h.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Nigel Cunningham <ncunningham@crca.org.au>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- avoid wasting more precious resources (DMA or DMA32 pools), when
being called through vmalloc_32{,_user}()
- explicitly allow using high memory here even if the outer allocation
request doesn't allow it
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Objects passed to NODEMASK_ALLOC() are relatively small in size and are
backed by slab caches that are not of large order, traditionally never
greater than PAGE_ALLOC_COSTLY_ORDER.
Thus, using GFP_KERNEL for these allocations on large machines when
CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
the allocation attempt, each time invoking both direct reclaim or the oom
killer.
This is of particular interest when using NODEMASK_ALLOC() from a
mempolicy context (either directly in mm/mempolicy.c or the mempolicy
constrained hugetlb allocations) since the oom killer always kills current
when allocations are constrained by mempolicies. So for all present use
cases in the kernel, current would end up being oom killed when direct
reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but
current would have sacrificed itself upon returning.
This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
All current use cases either directly from hugetlb code or indirectly via
NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
killer when the slab allocator needs to allocate additional pages.
The side-effect of this change is that all current use cases of either
NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
when the allocation fails (never for CONFIG_NODES_SHIFT <= 8). All
current use cases were audited and do have appropriate error handling at
this time.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
there are no present pages left.
In such a situation, kswapd must also be stopped since it has nothing left
to do.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Register per node hstate sysfs attributes only for nodes with memory.
Global replacement of 'all online nodes" with "all nodes with memory" in
mm/hugetlb.c. Suggested by David Rientjes.
A subsequent patch will handle adding/removing of per node hstate sysfs
attributes when nodes transition to/from memoryless state via memory
hotplug.
NOTE: this patch has not been tested with memoryless nodes.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for constraining huge page allocation and freeing by the
controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
to the allocate, free and surplus adjustment functions. For now, pass
NULL to indicate default behavior--i.e., use node_online_map. A
subsqeuent patch will derive a non-default mask from the controlling
task's numa mempolicy.
Note that this method of updating the global hstate nr_hugepages under the
constraint of a nodemask simplifies keeping the global state
consistent--especially the number of persistent and surplus pages relative
to reservations and overcommit limits. There are undoubtedly other ways
to do this, but this works for both interfaces: mempolicy and per node
attributes.
[rientjes@google.com: fix HIGHMEM compile error]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid". Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether or not
we successfully allocated/freed a huge page on the node, now we only call
these functions on failure to alloc/free to advance to next allowed node.
Factor out the next_node_allowed() function to handle wrap at end of
node_online_map. In this version, the allowed nodes include all of the
online nodes.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
in right after isolate_page().
This patch does it.
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On ia64, the following test program exit abnormally, because glibc thread
library called abort().
========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================
The fact is, glibc call munmap() when thread exitng time for freeing
stack, and it assume munlock() never fail. However, munmap() often make
vma splitting and it with many mapcount make -ENOMEM.
Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.
This patch does it.
test_max_mapcount.c
==================================================================
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<pthread.h>
#include<errno.h>
#include<unistd.h>
#define THREAD_NUM 30000
#define MAL_SIZE (8*1024*1024)
void *wait_thread(void *args)
{
void *addr;
addr = malloc(MAL_SIZE);
sleep(10);
return NULL;
}
void *wait_thread2(void *args)
{
sleep(60);
return NULL;
}
int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;
ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}
ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}
for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}
sleep(3600);
return 0;
}
==================================================================
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom killer header, including information such as the allocation order
and gfp mask, current's cpuset and memory controller, call trace, and VM
state information is currently only shown when the oom killer has selected
a task to kill.
This information is omitted, however, when the oom killer panics either
because of panic_on_oom sysctl settings or when no killable task was
found. It is still relevant to know crucial pieces of information such as
the allocation order and VM state when diagnosing such issues, especially
at boot.
This patch displays the oom killer header whenever it panics so that bug
reports can include pertinent information to debug the issue, if possible.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, mce: Clean up thermal init by introducing intel_thermal_supported()
x86, mce: Thermal monitoring depends on APIC being enabled
x86: Gart: fix breakage due to IOMMU initialization cleanup
x86: Move swiotlb initialization before dma32_free_bootmem
x86: Fix build warning in arch/x86/mm/mmio-mod.c
x86: Remove usedac in feature-removal-schedule.txt
x86: Fix duplicated UV BAU interrupt vector
nvram: Fix write beyond end condition; prove to gcc copy is safe
mm: Adjust do_pages_stat() so gcc can see copy_from_user() is safe
x86: Limit the number of processor bootup messages
x86: Remove enabling x2apic message for every CPU
doc: Add documentation for bootloader_{type,version}
x86, msr: Add support for non-contiguous cpumasks
x86: Use find_e820() instead of hard coded trampoline address
x86, AMD: Fix stale cpuid4_info shared_map data in shared_cpu_map cpumasks
Trivial percpu-naming-introduced conflicts in arch/x86/kernel/cpu/intel_cacheinfo.c
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf sched: Fix build failure on sparc
perf bench: Add "all" pseudo subsystem and "all" pseudo suite
perf tools: Introduce perf_session class
perf symbols: Ditch dso->find_symbol
perf symbols: Allow lookups by symbol name too
perf symbols: Add missing "Variables" entry to map_type__name
perf symbols: Add support for 'variable' symtabs
perf symbols: Introduce ELF counterparts to symbol_type__is_a
perf symbols: Introduce symbol_type__is_a
perf symbols: Rename kthreads to kmaps, using another abstraction for it
perf tools: Allow building for ARM
hw-breakpoints: Handle bad modify_user_hw_breakpoint off-case return value
perf tools: Allow cross compiling
tracing, slab: Fix no callsite ifndef CONFIG_KMEMTRACE
tracing, slab: Define kmem_cache_alloc_notrace ifdef CONFIG_TRACING
Trivial conflict due to different fixes to modify_user_hw_breakpoint()
in include/linux/hw_breakpoint.h
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
m68k: rename global variable vmalloc_end to m68k_vmalloc_end
percpu: add missing per_cpu_ptr_to_phys() definition for UP
percpu: Fix kdump failure if booted with percpu_alloc=page
percpu: make misc percpu symbols unique
percpu: make percpu symbols in ia64 unique
percpu: make percpu symbols in powerpc unique
percpu: make percpu symbols in x86 unique
percpu: make percpu symbols in xen unique
percpu: make percpu symbols in cpufreq unique
percpu: make percpu symbols in oprofile unique
percpu: make percpu symbols in tracer unique
percpu: make percpu symbols under kernel/ and mm/ unique
percpu: remove some sparse warnings
percpu: make alloc_percpu() handle array types
vmalloc: fix use of non-existent percpu variable in put_cpu_var()
this_cpu: Use this_cpu_xx in trace_functions_graph.c
this_cpu: Use this_cpu_xx for ftrace
this_cpu: Use this_cpu_xx in nmi handling
this_cpu: Use this_cpu operations in RCU
this_cpu: Use this_cpu ops for VM statistics
...
Fix up trivial (famous last words) global per-cpu naming conflicts in
arch/x86/kvm/svm.c
mm/slab.c
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
ext3: Fix data / filesystem corruption when write fails to copy data
ext4: Support for 64-bit quota format
ext3: Support for vfsv1 quota format
quota: Implement quota format with 64-bit space and inode limits
quota: Move definition of QFMT_OCFS2 to linux/quota.h
ext2: fix comment in ext2_find_entry about return values
ext3: Unify log messages in ext3
ext2: clear uptodate flag on super block I/O error
ext2: Unify log messages in ext2
ext3: make "norecovery" an alias for "noload"
ext3: Don't update the superblock in ext3_statfs()
ext3: journal all modifications in ext3_xattr_set_handle
ext2: Explicitly assign values to on-disk enum of filetypes
quota: Fix WARN_ON in lookup_one_len
const: struct quota_format_ops
ubifs: remove manual O_SYNC handling
afs: remove manual O_SYNC handling
kill wait_on_page_writeback_range
vfs: Implement proper O_SYNC semantics
...
Slightly adjust the logic for determining the size of the
copy_form_user() in do_pages_stat(); with this change, gcc can see
that the copying is safe.
Without this, we get a build error for i386 allyesconfig:
/home/hpa/kernel/linux-2.6-tip.urgent/arch/x86/include/asm/uaccess_32.h:213:
error: call to ‘copy_from_user_overflow’ declared with attribute
error: copy_from_user() buffer size is not provably correct
Unlike an earlier patch from Arjan, this doesn't introduce new
variables; merely reshuffles the compare so that gcc can see that an
overflow cannot happen.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20090926205406.30d55b08@infradead.org>
New helper - sys_mmap_pgoff(); switch syscalls to using it.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Take the check for being able to expand vma in place into a separate
helper.
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Take the MREMAP_FIXED into a separate helper, simplify the living
hell out of conditions in both cases.
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Take locating vma and checks on it to a separate helper (it will be
shared between MREMAP_FIXED/non-MREMAP_FIXED cases when we split
them in the next patch)
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Define kmem_trace_alloc_{,node}_notrace() if CONFIG_TRACING is
enabled, otherwise perf-kmem will show wrong stats ifndef
CONFIG_KMEM_TRACE, because a kmalloc() memory allocation may
be traced by both trace_kmalloc() and trace_kmem_cache_alloc().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
LKML-Reference: <4B21F89A.7000801@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
All callers really want the more logical filemap_fdatawait_range interface,
so convert them to use it and merge wait_on_page_writeback_range into
filemap_fdatawait_range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
cfq-iosched: Do not access cfqq after freeing it
block: include linux/err.h to use ERR_PTR
cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
blkio: Allow CFQ group IO scheduling even when CFQ is a module
blkio: Implement dynamic io controlling policy registration
blkio: Export some symbols from blkio as its user CFQ can be a module
block: Fix io_context leak after failure of clone with CLONE_IO
block: Fix io_context leak after clone with CLONE_IO
cfq-iosched: make nonrot check logic consistent
io controller: quick fix for blk-cgroup and modular CFQ
cfq-iosched: move IO controller declerations to a header file
cfq-iosched: fix compile problem with !CONFIG_CGROUP
blkio: Documentation
blkio: Wait on sync-noidle queue even if rq_noidle = 1
blkio: Implement group_isolation tunable
blkio: Determine async workload length based on total number of queues
blkio: Wait for cfq queue to get backlogged if group is empty
blkio: Propagate cgroup weight updation to cfq groups
blkio: Drop the reference to queue once the task changes cgroup
blkio: Provide some isolation between groups
...
In ____cache_alloc(), the variable 'ac' may be changed after
cache_alloc_refill() and the following kmemleak_erase() may get an incorrect
pointer. Update 'ac' after cache_alloc_refill() unconditionally.
See the following URL for the discussion of this patch:
http://marc.info/?l=linux-kernel&m=125873373124187&w=2
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
When the gotten object is NULL (probably due to ENOMEM), kmemleak_erase() is
unnecessary here, It just sets NULL to where already is NULL. Add a condition.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Branch profiling on my nehalem machine showed 99% incorrect branch hints:
28459 7678524 99 __cache_alloc_node slab.c 3551
Discussion on lkml [1] led to the solution to remove this hint.
[1] http://patchwork.kernel.org/patch/63517/
Signed-off-by: Tim Blechmann <tim@klingt.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
x86, Calgary IOMMU quirk: Find nearest matching Calgary while walking up the PCI tree
x86/amd-iommu: Remove amd_iommu_pd_table
x86/amd-iommu: Move reset_iommu_command_buffer out of locked code
x86/amd-iommu: Cleanup DTE flushing code
x86/amd-iommu: Introduce iommu_flush_device() function
x86/amd-iommu: Cleanup attach/detach_device code
x86/amd-iommu: Keep devices per domain in a list
x86/amd-iommu: Add device bind reference counting
x86/amd-iommu: Use dev->arch->iommu to store iommu related information
x86/amd-iommu: Remove support for domain sharing
x86/amd-iommu: Rearrange dma_ops related functions
x86/amd-iommu: Move some pte allocation functions in the right section
x86/amd-iommu: Remove iommu parameter from dma_ops_domain_alloc
x86/amd-iommu: Use get_device_id and check_device where appropriate
x86/amd-iommu: Move find_protection_domain to helper functions
x86/amd-iommu: Simplify get_device_resources()
x86/amd-iommu: Let domain_for_device handle aliases
x86/amd-iommu: Remove iommu specific handling from dma_ops path
x86/amd-iommu: Remove iommu parameter from __(un)map_single
x86/amd-iommu: Make alloc_new_range aware of multiple IOMMUs
...
invalidate_inode_pages2() returns -EBUSY *NOT* -EIO if any pages could not be
invalidated.
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
- no one is calling wb_writeback and write_cache_pages with
wbc.nonblocking=1 any more
- lumpy pageout will want to do nonblocking writeback without the
congestion wait
So remove the congestion checks as suggested by Chris.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Alex Elder <aelder@sgi.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
To touch task->flags directly is racy. thaw_process() still has race
(changing non_current->flags, but this is another issue) though, I think
it's much better off.
So, use thaw_process() instead.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cciss: make device attrs static
Thaw refrigerated bdi flusher threads before invoking kthread_stop on them
As reported by Paul McKenney:
I am seeing some lockdep complaints in rcutorture runs that include
frequent CPU-hotplug operations. The tests are otherwise successful.
My first thought was to send a patch that gave each array_cache
structure's ->lock field its own struct lock_class_key, but you already
have a init_lock_keys() that seems to be intended to deal with this.
------------------------------------------------------------------------
=============================================
[ INFO: possible recursive locking detected ]
2.6.32-rc4-autokern1 #1
---------------------------------------------
syslogd/2908 is trying to acquire lock:
(&nc->lock){..-...}, at: [<c0000000001407f4>] .kmem_cache_free+0x118/0x2d4
but task is already holding lock:
(&nc->lock){..-...}, at: [<c0000000001411bc>] .kfree+0x1f0/0x324
other info that might help us debug this:
3 locks held by syslogd/2908:
#0: (&u->readlock){+.+.+.}, at: [<c0000000004556f8>] .unix_dgram_recvmsg+0x70/0x338
#1: (&nc->lock){..-...}, at: [<c0000000001411bc>] .kfree+0x1f0/0x324
#2: (&parent->list_lock){-.-...}, at: [<c000000000140f64>] .__drain_alien_cache+0x50/0xb8
stack backtrace:
Call Trace:
[c0000000e8ccafc0] [c0000000000101e4] .show_stack+0x70/0x184 (unreliable)
[c0000000e8ccb070] [c0000000000afebc] .validate_chain+0x6ec/0xf58
[c0000000e8ccb180] [c0000000000b0ff0] .__lock_acquire+0x8c8/0x974
[c0000000e8ccb280] [c0000000000b2290] .lock_acquire+0x140/0x18c
[c0000000e8ccb350] [c000000000468df0] ._spin_lock+0x48/0x70
[c0000000e8ccb3e0] [c0000000001407f4] .kmem_cache_free+0x118/0x2d4
[c0000000e8ccb4a0] [c000000000140b90] .free_block+0x130/0x1a8
[c0000000e8ccb540] [c000000000140f94] .__drain_alien_cache+0x80/0xb8
[c0000000e8ccb5e0] [c0000000001411e0] .kfree+0x214/0x324
[c0000000e8ccb6a0] [c0000000003ca860] .skb_release_data+0xe8/0x104
[c0000000e8ccb730] [c0000000003ca2ec] .__kfree_skb+0x20/0xd4
[c0000000e8ccb7b0] [c0000000003cf2c8] .skb_free_datagram+0x1c/0x5c
[c0000000e8ccb830] [c00000000045597c] .unix_dgram_recvmsg+0x2f4/0x338
[c0000000e8ccb920] [c0000000003c0f14] .sock_recvmsg+0xf4/0x13c
[c0000000e8ccbb30] [c0000000003c28ec] .SyS_recvfrom+0xb4/0x130
[c0000000e8ccbcb0] [c0000000003bfb78] .sys_recv+0x18/0x2c
[c0000000e8ccbd20] [c0000000003ed388] .compat_sys_recv+0x14/0x28
[c0000000e8ccbd90] [c0000000003ee1bc] .compat_sys_socketcall+0x178/0x220
[c0000000e8ccbe30] [c0000000000085d4] syscall_exit+0x0/0x40
This patch fixes the issue by setting up lockdep annotations during CPU
hotplug.
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
The unlikely() annotation in slab_alloc() covers too much of the expression.
It's actually very likely that the object is not NULL so use unlikely() only
for the __GFP_ZERO expression like SLAB does.
The patch reduces kernel text by 29 bytes on x86-64:
text data bss dec hex filename
24185 8560 176 32921 8099 mm/slub.o.orig
24156 8560 176 32892 807c mm/slub.o
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Commit 53d0422 ("tracing: Convert some kmem events to DEFINE_EVENT")
moved the kmem tracepoint creation from util.c to page_alloc.c,
but forgot to move the exports.
Move them back.
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
LKML-Reference: <4B0E286A.2000405@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
o kdump functionality reserves a per cpu area at boot time and exports the
physical address of that area to user space through sys interface. This
area stores some dump related information like cpu register states etc
at the time of crash.
o We were assuming that per cpu area always come from linearly mapped meory
region and using __pa() to determine physical address.
With percpu_alloc=page, per cpu area can come from vmalloc region also and
__pa() breaks.
o This patch implments a new function to convert per cpu address to
physical address.
Before the patch, crash_notes addresses looked as follows.
cpu0 60fffff49800
cpu1 60fffff60800
cpu2 60fffff77800
These are bogus phsyical addresses.
After the patch, address are following.
cpu0 13eb44000
cpu1 13eb43000
cpu2 13eb42000
cpu3 13eb41000
These look fine. I got 4G of memory and /proc/iomem tell me following.
100000000-13fffffff : System RAM
tj: * added missing asm/io.h include reported by Stephen Rothwell
* repositioned per_cpu_ptr_phys() in percpu.c and added comment.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Allow memory hotplug and hibernation in the same kernel
Memory hotplug and hibernation were exclusive in Kconfig. This is
obviously a problem for distribution kernels who want to support both in
the same image.
After some discussions with Rafael and others the only problem is with
parallel memory hotadd or removal while a hibernation operation is in
process. It was also working for s390 before.
This patch removes the Kconfig level exclusion, and simply makes the
memory add / remove functions grab the pm_mutex to exclude against
hibernation.
Fixes a regression - old kernels didn't exclude memory hotadd and
hibernation.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_MEMORY_HOTPLUG I got following warning:
WARNING: vmlinux.o(.text+0x1276b0): Section mismatch in reference from
the function hotadd_new_pgdat() to the function
.meminit.text:free_area_init_node()
The function hotadd_new_pgdat() references
the function __meminit free_area_init_node().
This is often because hotadd_new_pgdat lacks a __meminit
annotation or the annotation of free_area_init_node is wrong.
Use __ref to fix this.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: restructure pcpu_extend_area_map() to fix bugs and improve readability
pcpu_extend_area_map() had the following two bugs.
* It should return 1 if pcpu_lock was dropped and reacquired but it
returned 0. This could lead to oops if free_percpu() races with
area map extension.
* pcpu_mem_free() was called under pcpu_lock. pcpu_mem_free() might
end up calling vfree() which isn't IRQ safe. This could lead to
deadlock through lock order inversion via IRQ.
In addition, Linus pointed out that the temporary lock dropping and
subtle three-way return value of pcpu_extend_area_map() was very ugly
and suggested to split the function into two - pcpu_need_to_extend()
and pcpu_extend_area_map().
This patch restructures pcpu_extend_area_map() as suggested and fixes
the two bugs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Lee Schermerhorn reported that he saw bad pointer dereference in
mem_cgroup_end_migration() when he disabled memcg by boot option.
memcg's page migration logic works as
mem_cgroup_prepare_migration(page, &ptr);
do page migration
mem_cgroup_end_migration(page, ptr);
Now, ptr is not initialized in prepare_migration when memcg is disabled
by boot option. This causes panic in end_migration. This patch fixes it.
Reported-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 341ce06f69 ("page allocator:
calculate the alloc_flags for allocation only once") altered watermark
logic slightly by allowing rt_tasks that are handling an interrupt to set
ALLOC_HARDER. This patch brings the watermark logic more in line with
2.6.30.
This change results in a reduction of the number high-order GFP_ATOMIC
allocation failures reported. See
http://www.gossamer-threads.com/lists/linux/kernel/1144153
[rientjes@google.com: Spotted the problem]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
allocation afterwards. In times past, this would always wake kswapd as
well but currently, kswapd is not woken up after direct reclaim fails.
For order-0 allocations, this makes little difference but if there is a
heavy mix of higher-order allocations that direct reclaim is failing for,
it might mean that kswapd is not rewoken for higher orders as much as it
did previously.
This patch wakes up kswapd when an allocation is being retried after a
direct reclaim failure. It would be expected that kswapd is already
awake, but this has the effect of telling kswapd to reclaim at the higher
order as well.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unfreezes the bdi flusher task when the said task needs to exit.
Steps to reproduce this.
1) Mount a file system from MMC/SD card.
2) Unmount the file system. This creates a flusher task.
3) Attempt suspend to RAM. System is unresponsive.
This is because the bdi flusher thread is already in the refrigerator and will
remain so until it is thawed. The MMC driver suspend routine call stack will
ultimately issue a 'kthread_stop' on the bdi flusher thread and will block
until the flusher thread is exited. Since the bdi flusher thread is in the
refrigerator it never cleans up until thawed.
Signed-off-by: Romit Dasgupta <romit@ti.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
highmem: Fix debug_kmap_atomic() to also handle KM_IRQ_PTE, KM_NMI, and KM_NMI_PTE
highmem: Fix race in debug_kmap_atomic() which could cause warn_count to underflow
rcu: Fix long-grace-period race between forcing and initialization
uids: Prevent tear down race
Add a new function for freeing bootmem after the bootmem
allocator has been released and the unreserved pages given to
the page allocator.
This allows us to reserve bootmem and then release it if we
later discover it was not needed.
( This new API will be used by the swiotlb code to recover
a significant amount of RAM (64MB). )
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: chrisw@sous-sol.org
Cc: dwmw2@infradead.org
Cc: joerg.roedel@amd.com
Cc: muli@il.ibm.com
Cc: hannes@cmpxchg.org
Cc: tj@kernel.org
Cc: akpm@linux-foundation.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1257849980-22640-7-git-send-email-fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
debug_kmap_atomic() tries to prevent ever printing more than 10
warnings, but it does so by testing whether an unsigned integer
is equal to 0. However, if the warning is caused by a nested
IRQ, then this counter may underflow and the stream of warnings
will never end.
Fix that by using a signed integer instead.
Signed-off-by: Soeren Sandmann Pedersen <sandmann@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: a.p.zijlstra@chello.nl
Cc: <stable@kernel.org> # .31.x
LKML-Reference: <ye8zl7b8ktj.fsf@camel23.daimi.au.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
KSM needs a cond_resched() for CONFIG_PREEMPT_NONE, in its unbounded
search of the unstable tree. The stable tree cases already have one,
and originally there was one down inside get_user_pages();
but I missed it when I converted to follow_page() instead.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch was generated by
git grep -E -i -l '[Aa]quire' | xargs -r perl -p -i -e 's/([Aa])quire/$1cquire/'
and the cumsumed was found by checking the diff for aquire.
Signed-off-by: Uwe Kleine-Knig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cfq-iosched: limit coop preemption
cfq-iosched: fix bad return value cfq_should_preempt()
backing-dev: bdi sb prune should be in the unregister path, not destroy
Fix bio_alloc() and bio_kmalloc() documentation
bio_put(): add bio_clone() to the list of functions in the comment
Commit 592b09a42f was different from
the tested path, in that it moved the bdi super_block prune from
unregister to destroy context. This doesn't fully fix the sync hang
bug on unexpected device removal, as need to prune the bdi cache
pointer before killing flusher thread.
Tested-by: Artur Skawina <art.08.09@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In try_to_unuse(), swcount is a local copy of *swap_map, including the
SWAP_HAS_CACHE bit; but a wrong comparison against swap_count(*swap_map),
which masks off the SWAP_HAS_CACHE bit, succeeded where it should fail.
That had the effect of resetting the mm from which to start searching
for the next swap page, to an irrelevant mm instead of to an mm in which
this swap page had been found: which may increase search time by ~20%.
But we're used to swapoff being slow, so never noticed the slowdown.
Remove that one spurious use of swap_count(): Bo Liu thought it merely
redundant, Hugh rewrote the description since it was measurably wrong.
Signed-off-by: Bo Liu <bo-liu@hotmail.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't pass NULL pointers to fput() in the error handling paths of the NOMMU
do_mmap_pgoff() as it can't handle it.
The following can be used as a test program:
int main() { static long long a[1024 * 1024 * 20] = { 0 }; return a;}
Without the patch, the code oopses in atomic_long_dec_and_test() as called by
fput() after the kernel complains that it can't allocate that big a chunk of
memory. With the patch, the kernel just complains about the allocation size
and then the program segfaults during execve() as execve() can't complete the
allocation of all the new ELF program segments.
Reported-by: Robin Getz <rgetz@blackfin.uclinux.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
sched: move rq_weight data array out of .percpu
percpu: allow pcpu_alloc() to be called with IRQs off
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
backing-dev: ensure that a removed bdi no longer has super_block referencing it
block: use after free bug in __blkdev_get
block: silently error unsupported empty barriers too
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/ppc64: Use preempt_schedule_irq instead of preempt_schedule
powerpc: Minor cleanup to lib/Kconfig.debug
powerpc: Minor cleanup to sound/ppc/Kconfig
powerpc: Minor cleanup to init/Kconfig
powerpc: Limit memory hotplug support to PPC64 Book-3S machines
powerpc: Limit hugetlbfs support to PPC64 Book-3S machines
powerpc: Fix compile errors found by new ppc64e_defconfig
powerpc: Add a Book-3E 64-bit defconfig
powerpc/booke: Fix xmon single step on PowerPC Book-E
powerpc: Align vDSO base address
powerpc: Fix segment mapping in vdso32
powerpc/iseries: Remove compiler version dependent hack
powerpc/perf_events: Fix priority of MSR HV vs PR bits
powerpc/5200: Update defconfigs
drivers/serial/mpc52xx_uart.c: Use UPIO_MEM rather than SERIAL_IO_MEM
powerpc/boot/dts: drop obsolete 'fsl5200-clocking'
of: Remove nested function
mpc5200: support for the MAN mpc5200 based board mucmc52
mpc5200: support for the MAN mpc5200 based board uc101
* 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
HWPOISON: fix invalid page count in printk output
HWPOISON: Allow schedule_on_each_cpu() from keventd
HWPOISON: fix/proc/meminfo alignment
HWPOISON: fix oops on ksm pages
HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page
HWPOISON: return early on non-LRU pages
HWPOISON: Add brief hwpoison description to Documentation
HWPOISON: Clean up PR_MCE_KILL interface
There are some places where we do like:
pte = pte_map();
do {
(do break in some conditions)
} while (pte++, ...);
pte_unmap(pte - 1);
But if the loop breaks at the first loop, pte_unmap() unmaps invalid pte.
This patch is a fix for this problem.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, sparsemem is only available if EXPERIMENTAL is enabled.
However, it hasn't ever been marked experimental.
It's been about four years since sparsemem was merged, and we have
platforms which depend on it; allow architectures to decide whether
sparsemem should be the default memory model.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Isolators putting a page back to the LRU do not hold the page lock, and if
the page is mlocked, another thread might munlock it concurrently.
Expecting this, the putback code re-checks the evictability of a page when
it just moved it to the unevictable list in order to correct its decision.
The problem, however, is that ordering is not garuanteed between setting
PG_lru when moving the page to the list and checking PG_mlocked
afterwards:
#0: #1
spin_lock()
if (TestClearPageMlocked())
if (PageLRU())
move to evictable list
SetPageLRU()
spin_unlock()
if (!PageMlocked())
move to evictable list
The PageMlocked() check may get reordered before SetPageLRU() in #0,
resulting in #0 not moving the still mlocked page, and in #1 failing to
isolate and move the page as well. The page is now stranded on the
unevictable list.
The race condition is very unlikely. The consequence currently is one
page falling off the reclaim grid and eventually getting freed with
PG_unevictable set, which triggers a warning in the page allocator.
TestClearPageMlocked() in #1 already provides full memory barrier
semantics.
This patch adds an explicit full barrier to force ordering between
SetPageLRU() and PageMlocked() so that either one of the competitors
rescues the page.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If migrate_prep is failed, new variable is leaked. This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is possible to have !Anon but SwapBacked pages, and some apps could
create huge number of such pages with MAP_SHARED|MAP_ANONYMOUS. These
pages go into the ANON lru list, and hence shall not be protected: we only
care mapped executable files. Failing to do so may trigger OOM.
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert
commit 71de1ccbe1
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
AuthorDate: Mon Sep 21 17:01:31 2009 -0700
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Tue Sep 22 07:17:27 2009 -0700
mm: oom analysis: add buffer cache information to show_free_areas()
show_free_areas() is called during page allocation failures, and page
allocation failures can occur in any calling context.
But nr_blockdev_pages() takes VFS locks which should not be taken from
hard IRQ context (at least). The result is lockdep warnings (and
deadlockability) during page allocation failures.
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 8aa7e847d (Fix congestion_wait() sync/async vs read/write
confusion) replace WRITE with BLK_RW_ASYNC. Unfortunately, concurrent mm
development made the unchanged place accidentally.
This patch fixes it too.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory failure on a KSM page currently oopses on its NULL anon_vma in
page_lock_anon_vma(): that may not be much worse than the consequence of
ignoring it, but it is better to be consistent with how ZERO_PAGE and
hugetlb pages and other awkward cases are treated. Just skip it.
We could fix it for 2.6.32 at the KSM end, by putting a dummy anon_vma
pointer in there; but that would get harder next time, when KSM will put a
pointer to something else there (and I'm not currently planning to do any
work to open that up to memory_failure). So I would prefer this simple
PageKsm test, until the other exceptions are handled.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch updates percpu related symbols under kernel/ and mm/ such
that percpu symbols are unique and don't clash with local symbols.
This serves two purposes of decreasing the possibility of global
percpu symbol collision and allowing dropping per_cpu__ prefix from
percpu symbols.
* kernel/lockdep.c: s/lock_stats/cpu_lock_stats/
* kernel/sched.c: s/init_rq_rt/init_rt_rq_var/ (any better idea?)
s/sched_group_cpus/sched_groups/
* kernel/softirq.c: s/ksoftirqd/run_ksoftirqd/a
* kernel/softlockup.c: s/(*)_timestamp/softlockup_\1_ts/
s/watchdog_task/softlockup_watchdog/
s/timestamp/ts/ for local variables
* kernel/time/timer_stats: s/lookup_lock/tstats_lookup_lock/
* mm/slab.c: s/reap_work/slab_reap_work/
s/reap_node/slab_reap_node/
* mm/vmstat.c: local variable changed to avoid collision with vmstat_work
Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
which cause name clashes" patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: (slab/vmstat) Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Make the following changes to remove some sparse warnings.
* Make DEFINE_PER_CPU_SECTION() declare __pcpu_unique_* before
defining it.
* Annotate pcpu_extend_area_map() that it is entered with pcpu_lock
held, releases it and then reacquires it.
* Make percpu related macros use unique nested variable names.
* While at it, add pcpu prefix to __size_call[_return]() macros as
to-be-implemented sparse annotations will add percpu specific stuff
to these macros.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
vmalloc used non-existent percpu variable vmap_cpu_blocks instead of
the intended vmap_block_queue. This went unnoticed because
put_cpu_var() didn't evaluate the parameter. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@suse.de>
When the bdi is being removed, we have to ensure that no super_blocks
currently have that cached in sb->s_bdi. Normally this is ensured by
the sb having a longer life span than the bdi, but if the device is
suddenly yanked, we have to kill this reference. sb->s_bdi is pointed
to freed memory at that point.
This fixes a problem with sync(1) hanging when a USB stick is pulled
without cleanly umounting it first.
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
pcpu_alloc() and pcpu_extend_area_map() perform a series of
spin_lock_irq()/spin_unlock_irq() calls, which make them unsafe
with respect to being called from contexts which have IRQs off.
This patch converts the code to perform save/restore of flags instead,
making pcpu_alloc() (or __alloc_percpu() respectively) to be called
from early kernel startup stage, where IRQs are off.
This is needed for proper initialization of per-cpu rq_weight data from
sched_init().
tj: added comment explaining why irqsave/restore is used in alloc path.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Tejun Heo <tj@kernel.org>
Based on discussions on LKML and LSM, where there are consecutive
security_ and ima_ calls in the vfs layer, move the ima_ calls to
the existing security_ hooks.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
The madvise injector already holds a reference when passing in a page
to the memory-failure code. The code corrects for this additional reference
for its checks, but the final printk output didn't. Fix that.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Memory failure on a KSM page currently oopses on its NULL anon_vma in
page_lock_anon_vma(): that may not be much worse than the consequence
of ignoring it, but it is better to be consistent with how ZERO_PAGE
and hugetlb pages and other awkward cases are treated. Just skip it.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
When returning due to a poisoned page drop the page count.
It wasn't a fatal problem because noone cares about the page count
on a poisoned page (except when it wraps), but it's cleaner to fix it.
Pointed out by Linus.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Right now we have some trouble with non atomic access
to page flags when locking the page. To plug this hole
for now, limit error recovery to LRU pages for now.
This could be better fixed by defining a suitable protocol,
but let's go this simple way for now
This avoids unnecessary races with __set_page_locked() and
__SetPageSlab*() and maybe more non-atomic page flag operations.
This loses isolated pages which are currently in page reclaim, but these
are relatively limited compared to the total memory.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[AK: new description, bug fixes, cleanups]
When collecting slub stats for particular workloads, it's necessary to
collect each statistic for all caches before the job is even started
because the counters are usually greater than zero just from boot and
initialization.
This allows a statistic to be cleared on each cpu by writing '0' to its
sysfs file. This creates a baseline for statistics of interest before
the workload is started.
Setting a statistic to a particular value is not supported, so all values
written to these files other than '0' returns -EINVAL.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>