Commit Graph

5298 Commits

Author SHA1 Message Date
Christoph Lameter
a71ae47a2c slub: Fix double bit unlock in debug mode
Commit 442b06bcea ("slub: Remove node check in slab_free") added a
call to deactivate_slab() in the debug case in __slab_alloc(), which
unlocks the current slab used for allocation.  Going to the label
'unlock_out' then does it again.

Also, in the debug case we do not need all the other processing that the
'unlock_out' path does.  We always fall back to the slow path in the
debug case.  So the tid update is useless.

Similarly, ALLOC_SLOWPATH would just be incremented for all allocations.
Also a pretty useless thing.

So simply restore irq flags and return the object.

Signed-off-by: Christoph Lameter <cl@linux.com>
Reported-and-bisected-by: James Morris <jmorris@namei.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Jens Axboe <jaxboe@fusionio.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:38:24 -07:00
Linus Torvalds
0d66cba1ac Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (29 commits)
  [S390] cpu hotplug: fix external interrupt subclass mask handling
  [S390] oprofile: dont access lowcore
  [S390] oprofile: add missing irq stats counter
  [S390] Ignore sendmmsg system call note wired up warning
  [S390] s390,oprofile: fix compile error for !CONFIG_SMP
  [S390] s390,oprofile: fix alert counter increment
  [S390] Remove unused includes in process.c
  [S390] get CPC image name
  [S390] sclp: event buffer dissection
  [S390] chsc: process channel-path-availability information
  [S390] refactor page table functions for better pgste support
  [S390] merge page_test_dirty and page_clear_dirty
  [S390] qdio: prevent compile warning
  [S390] sclp: remove unnecessary sendmask check
  [S390] convert old cpumask API into new one
  [S390] pfault: cleanup code
  [S390] pfault: cpu hotplug vs missing completion interrupts
  [S390] smp: add __noreturn attribute to cpu_die()
  [S390] percpu: implement arch specific irqsafe_cpu_ops
  [S390] vdso: disable gcov profiling
  ...
2011-05-24 12:06:02 -07:00
Linus Torvalds
5129df03d0 Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: Unify input section names
  percpu: Avoid extra NOP in percpu_cmpxchg16b_double
  percpu: Cast away printk format warning
  percpu: Always align percpu output section to PAGE_SIZE

Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun
2011-05-24 11:53:42 -07:00
Tejun Heo
6988f20fe0 Merge branch 'fixes-2.6.39' into for-2.6.40 2011-05-24 09:59:36 +02:00
Linus Torvalds
4867faab1e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slub: Deal with hyperthetical case of PAGE_SIZE > 2M
  slub: Remove node check in slab_free
  slub: avoid label inside conditional
  slub: Make CONFIG_DEBUG_PAGE_ALLOC work with new fastpath
  slub: Avoid warning for !CONFIG_SLUB_DEBUG
  slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery
  slub: Move debug handlign in __slab_free
  slub: Move node determination out of hotpath
  slub: Eliminate repeated use of c->page through a new page variable
  slub: get_map() function to establish map of free objects in a slab
  slub: Use NUMA_NO_NODE in get_partial
  slub: Fix a typo in config name
2011-05-23 10:10:44 -07:00
Pekka Enberg
bfb91fb650 Merge branch 'slab/next' into for-linus
Conflicts:
	mm/slub.c
2011-05-23 19:50:39 +03:00
Linus Torvalds
57d19e80f4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  b43: fix comment typo reqest -> request
  Haavard Skinnemoen has left Atmel
  cris: typo in mach-fs Makefile
  Kconfig: fix copy/paste-ism for dell-wmi-aio driver
  doc: timers-howto: fix a typo ("unsgined")
  perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
  md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
  treewide: fix a few typos in comments
  regulator: change debug statement be consistent with the style of the rest
  Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
  audit: acquire creds selectively to reduce atomic op overhead
  rtlwifi: don't touch with treewide double semicolon removal
  treewide: cleanup continuations and remove logging message whitespace
  ath9k_hw: don't touch with treewide double semicolon removal
  include/linux/leds-regulator.h: fix syntax in example code
  tty: fix typo in descripton of tty_termios_encode_baud_rate
  xtensa: remove obsolete BKL kernel option from defconfig
  m68k: fix comment typo 'occcured'
  arch:Kconfig.locks Remove unused config option.
  treewide: remove extra semicolons
  ...
2011-05-23 09:12:26 -07:00
Martin Schwidefsky
2d42552d1c [S390] merge page_test_dirty and page_clear_dirty
The page_clear_dirty primitive always sets the default storage key
which resets the access control bits and the fetch protection bit.
That will surprise a KVM guest that sets non-zero access control
bits or the fetch protection bit. Merge page_test_dirty and
page_clear_dirty back to a single function and only clear the
dirty bit from the storage key.

In addition move the function page_test_and_clear_dirty and
page_test_and_clear_young to page.h where they belong. This
requires to change the parameter from a struct page * to a page
frame number.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-05-23 10:24:31 +02:00
Christoph Lameter
442b06bcea slub: Remove node check in slab_free
We can set the page pointing in the percpu structure to
NULL to have the same effect as setting c->node to NUMA_NO_NODE.

Gets rid of one check in slab_free() that was only used for
forcing the slab_free to the slowpath for debugging.

We still need to set c->node to NUMA_NO_NODE to force the
slab_alloc() fastpath to the slowpath in case of debugging.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-05-21 12:53:53 +03:00
Hugh Dickins
e6c9366b2a tmpfs: fix highmem swapoff crash regression
Commit 778dd893ae ("tmpfs: fix race between umount and swapoff")
forgot the new rules for strict atomic kmap nesting, causing

  WARNING: at arch/x86/mm/highmem_32.c:81

from __kunmap_atomic(), then

  BUG: unable to handle kernel paging request at fffb9000

from shmem_swp_set() when shmem_unuse_inode() is handling swapoff with
highmem in use.  My disgrace again.

See
  https://bugzilla.kernel.org/show_bug.cgi?id=35352

Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20 16:23:19 -07:00
Jeremy Fitzhardinge
ef691947d8 vmalloc: remove vmalloc_sync_all() from alloc_vm_area()
There's no need for it: it will get faulted into the current pagetable
as needed.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 14:14:32 -07:00
Linus Torvalds
268bb0ce3e sanitize <linux/prefetch.h> usage
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.

So this fixes things up a bit, using

   grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
   grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.

There are more of them around (mostly network drivers), but this gets
many core ones.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20 12:50:29 -07:00
Gustavo F. Padovan
345227d705 backing-dev: Kill set but not used var in bdi_debug_stats_show()
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 21:23:37 +02:00
Linus Torvalds
83d7e94875 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
  kmemleak: Initialise kmemleak after debug_objects_mem_init()
  kmemleak: Select DEBUG_FS unconditionally in DEBUG_KMEMLEAK
  kmemleak: Do not return a pointer to an object that kmemleak did not get
2011-05-19 16:44:13 -07:00
Catalin Marinas
52c3ce4ec5 kmemleak: Do not return a pointer to an object that kmemleak did not get
The kmemleak_seq_next() function tries to get an object (and increment
its use count) before returning it. If it could not get the last object
during list traversal (because it may have been freed), the function
should return NULL rather than a pointer to such object that it did not
get.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Acked-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Cc: <stable@kernel.org>
2011-05-19 17:35:28 +01:00
KAMEZAWA Hiroyuki
d6c438b6cd memcg: fix zone congestion
ZONE_CONGESTED should be a state of global memory reclaim.  If not, a busy
memcg sets this and give unnecessary throttoling in wait_iff_congested()
against memory recalim in other contexts.  This makes system performance
bad.

I'll think about "memcg is congested!" flag is required or not, later.
But this fix is required first.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Ying Han <yinghan@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-18 02:55:23 -07:00
David Rientjes
bd07d87fd4 slub: avoid label inside conditional
Jumping to a label inside a conditional is considered poor style,
especially considering the current organization of __slab_alloc().

This removes the 'load_from_page' label and just duplicates the three
lines of code that it uses:

	c->node = page_to_nid(page);
	c->page = page;
	goto load_freelist;

since it's probably not worth making this a separate helper function.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-05-17 22:19:00 +03:00
Christoph Lameter
1393d9a185 slub: Make CONFIG_DEBUG_PAGE_ALLOC work with new fastpath
Fastpath can do a speculative access to a page that CONFIG_DEBUG_PAGE_ALLOC may have
marked as invalid to retrieve the pointer to the next free object.

Use probe_kernel_read in that case in order not to cause a page fault.

Cc: <stable@kernel.org> # 38.x
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-05-17 22:18:55 +03:00
Christoph Lameter
6332aa9d25 slub: Avoid warning for !CONFIG_SLUB_DEBUG
Move the #ifdef so that get_map is only defined if CONFIG_SLUB_DEBUG is defined.

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-05-17 22:16:08 +03:00
Randy Dunlap
b5e6ab589d mm: fix kernel-doc warning in page_alloc.c
Fix new kernel-doc warning in mm/page_alloc.c:

  Warning(mm/page_alloc.c:2370): No description found for parameter 'nid'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-16 18:34:30 -07:00
Hugh Dickins
05bf86b4cc tmpfs: fix race between swapoff and writepage
Shame on me!  Commit b1dea800ac "tmpfs: fix race between umount and
writepage" fixed the advertized race, but introduced another: as even
its comment makes clear, we cannot safely rely on a peek at list_empty()
while holding no lock - until info->swapped is set, shmem_unuse_inode()
may delete any formerly-swapped inode from the shmem_swaplist, which
in this case would leave a swap area impossible to swapoff.

Although I don't relish taking the mutex every time, I don't care much
for the alternatives either; and at least the peek at list_empty() in
shmem_evict_inode() (a hotter path since most inodes would never have
been swapped) remains safe, because we already truncated the whole file.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-14 12:18:55 -07:00
Hugh Dickins
59a16ead57 tmpfs: fix spurious ENOSPC when racing with unswap
Testing the shmem_swaplist replacements for igrab() revealed another bug:
writes to /dev/loop0 on a tmpfs file which fills its filesystem were
sometimes failing with "Buffer I/O error"s.

These came from ENOSPC failures of shmem_getpage(), when racing with
swapoff: the same could happen when racing with another shmem_getpage(),
pulling the page in from swap in between our find_lock_page() and our
taking the info->lock (though not in the single-threaded loop case).

This is unacceptable, and surprising that I've not noticed it before:
it dates back many years, but (presumably) was made a lot easier to
reproduce in 2.6.36, which sited a page preallocation in the race window.

Fix it by rechecking the page cache before settling on an ENOSPC error.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11 18:50:45 -07:00
Hugh Dickins
778dd893ae tmpfs: fix race between umount and swapoff
The use of igrab() in swapoff's shmem_unuse_inode() is just as vulnerable
to umount as that in shmem_writepage().

Fix this instance by extending the protection of shmem_swaplist_mutex
right across shmem_unuse_inode(): while it's on the list, the inode cannot
be evicted (and the filesystem cannot be unmounted) without
shmem_evict_inode() taking that mutex to remove it from the list.

But since shmem_writepage() might take that mutex, we should avoid making
memory allocations or memcg charges while holding it: prepare them at the
outer level in shmem_unuse().  When mem_cgroup_cache_charge() was
originally placed, we didn't know until that point that the page from swap
was actually a shmem page; but nowadays it's noted in the swap_map, so
we're safe to charge upfront.  For the radix_tree, do as is done in
shmem_getpage(): preload upfront, but don't pin to the cpu; so we make a
habit of refreshing the node pool, but might dip into GFP_NOWAIT reserves
on occasion if subsequently preempted.

With the allocation and charge moved out from shmem_unuse_inode(),
we can also hold index map and info->lock over from finding the entry.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11 18:50:45 -07:00
Hugh Dickins
b1dea800ac tmpfs: fix race between umount and writepage
Konstanin Khlebnikov reports that a dangerous race between umount and
shmem_writepage can be reproduced by this script:

  for i in {1..300} ; do
	mkdir $i
	while true ; do
		mount -t tmpfs none $i
		dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100))
		umount $i
	done &
  done

on a 6xCPU node with 8Gb RAM: kernel very unstable after this accident. =)

Kernel log:

  VFS: Busy inodes after unmount of tmpfs.
                 Self-destruct in 5 seconds.  Have a nice day...

  WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
  list_del corruption. prev->next should be ffff880222fdaac8, but was (null)
  Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4
  Call Trace:
   warn_slowpath_common+0x80/0x98
   warn_slowpath_fmt+0x41/0x43
   __list_del_entry+0x8d/0x98
   evict+0x50/0x113
   iput+0x138/0x141
  ...
  BUG: unable to handle kernel paging request at ffffffffffffffff
  IP: shmem_free_blocks+0x18/0x4c
  Pid: 10422, comm: dd Tainted: G        W   2.6.39-rc2+ #4
  Call Trace:
   shmem_recalc_inode+0x61/0x66
   shmem_writepage+0xba/0x1dc
   pageout+0x13c/0x24c
   shrink_page_list+0x28e/0x4be
   shrink_inactive_list+0x21f/0x382
  ...

shmem_writepage() calls igrab() on the inode for the page which came from
page reclaim, to add it later into shmem_swaplist for swapoff operation.

This igrab() can race with super-block deactivating process:

  shrink_inactive_list()          deactivate_super()
  pageout()                       tmpfs_fs_type->kill_sb()
  shmem_writepage()               kill_litter_super()
                                  generic_shutdown_super()
                                   evict_inodes()
   igrab()
                                    atomic_read(&inode->i_count)
                                     skip-inode
   iput()
                                   if (!list_empty(&sb->s_inodes))
                                          printk("VFS: Busy inodes after...

This igrap-iput pair was added in commit 1b1b32f2c6 "tmpfs: fix
shmem_swaplist races" based on incorrect assumptions: igrab() protects the
inode from concurrent eviction by deletion, but it does nothing to protect
it from concurrent unmounting, which goes ahead despite the raised
i_count.

So this use of igrab() was wrong all along, but the race made much worse
in 2.6.37 when commit 63997e98a3 "split invalidate_inodes()" replaced
two attempts at invalidate_inodes() by a single evict_inodes().

Konstantin posted a plausible patch, raising sb->s_active too: I'm unsure
whether it was correct or not; but burnt once by igrab(), I am sure that
we don't want to rely more deeply upon externals here.

Fix it by adding the inode to shmem_swaplist earlier, while the page lock
on page in page cache still secures the inode against eviction, without
artifically raising i_count.  It was originally added later because
shmem_unuse_inode() is liable to remove an inode from the list while it's
unswapped; but we can guard against that by taking spinlock before
dropping mutex.

Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11 18:50:45 -07:00
Andi Kleen
21a3c96468 memcg: allocate memory cgroup structures in local nodes
Commit dde79e005a ("page_cgroup: reduce allocation overhead for
page_cgroup array for CONFIG_SPARSEMEM") added a regression that the
memory cgroup data structures all end up in node 0 because the first
attempt at allocating them would not pass in a node hint.  Since the
initialization runs on CPU #0 it would all end up node 0.  This is a
problem on large memory systems, where node 0 would lose a lot of
memory.

Change the alloc_pages_exact() to alloc_pages_exact_nid().  This will
still fall back to other nodes if not enough memory is available.

 [ RED-PEN: right now it would fall back first before trying
   vmalloc_node.  Probably not the best strategy ...  But I left it like
   that for now. ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reported-by: Doug Nelson
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11 18:50:45 -07:00
Andi Kleen
ee85c2e145 mm: add alloc_pages_exact_nid()
Add a alloc_pages_exact_nid() that allocates on a specific node.

The naming is quite broken, but fixing that would need a larger renaming
action.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11 18:50:45 -07:00
Yinghai Lu
8f389a99b6 mm: use alloc_bootmem_node_nopanic() on really needed path
Stefan found nobootmem does not work on his system that has only 8M of
RAM.  This causes an early panic:

  BIOS-provided physical RAM map:
   BIOS-88: 0000000000000000 - 000000000009f000 (usable)
   BIOS-88: 0000000000100000 - 0000000000840000 (usable)
  bootconsole [earlyser0] enabled
  Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS!
  DMI not present or invalid.
  last_pfn = 0x840 max_arch_pfn = 0x100000
  init_memory_mapping: 0000000000000000-0000000000840000
  8MB LOWMEM available.
    mapped low ram: 0 - 00840000
    low ram: 0 - 00840000
  Zone PFN ranges:
    DMA      0x00000001 -> 0x00001000
    Normal   empty
  Movable zone start PFN for each node
  early_node_map[2] active PFN ranges
      0: 0x00000001 -> 0x0000009f
      0: 0x00000100 -> 0x00000840
  BUG: Int 6: CR2 (null)
       EDI c034663c  ESI (null)  EBP c0329f38  ESP c0329ef4
       EBX c0346380  EDX 00000006  ECX ffffffff  EAX fffffff4
       err (null)  EIP c0353191   CS c0320060  flg 00010082
  Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null)
         00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null)
         (null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null)
  Pid: 0, comm: swapper Not tainted 2.6.36 #5
  Call Trace:
   [<c02e3707>] ? 0xc02e3707
   [<c035e6e5>] 0xc035e6e5
   [<c0353191>] ? 0xc0353191
   [<c03534d6>] 0xc03534d6
   [<c034f1cd>] 0xc034f1cd
   [<c034a824>] 0xc034a824
   [<c03513cb>] ? 0xc03513cb
   [<c0349432>] 0xc0349432
   [<c0349066>] 0xc0349066

It turns out that we should ignore the low limit of 16M.

Use alloc_bootmem_node_nopanic() in this case.

[akpm@linux-foundation.org: less mess]
Signed-off-by: Yinghai LU <yinghai@kernel.org>
Reported-by: Stefan Hellermann <stefan@the2masters.de>
Tested-by: Stefan Hellermann <stefan@the2masters.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>		[2.6.34+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11 18:50:44 -07:00
Minchan Kim
bad49d9c89 mm: check PageUnevictable in lru_deactivate_fn()
The lru_deactivate_fn should not move page which in on unevictable lru
into inactive list.  Otherwise, we can meet BUG when we use
isolate_lru_pages as __isolate_lru_page could return -EINVAL.

Reported-by: Ying Han <yinghan@google.com>
Tested-by: Ying Han <yinghan@google.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11 18:50:44 -07:00
Hugh Dickins
42c36f63ac vm: fix vm_pgoff wrap in upward expansion
Commit a626ca6a65 ("vm: fix vm_pgoff wrap in stack expansion") fixed
the case of an expanding mapping causing vm_pgoff wrapping when you had
downward stack expansion.  But there was another case where IA64 and
PA-RISC expand mappings: upward expansion.

This fixes that case too.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 17:52:17 -07:00
Mikulas Patocka
a09a79f668 Don't lock guardpage if the stack is growing up
Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.

[ Patch changed fairly extensively to also fix /proc/<pid>/maps for the
  grows-up case, and to move things around a bit to clean it all up and
  share the infrstructure with the /proc bits.

  Tested on ia64 that has both grow-up and grow-down segments  - Linus ]

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 16:22:07 -07:00
Christoph Lameter
1759415e63 slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery
Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used
everywhere.

There may be performance implications since:

A. We now have to manage a transaction ID for all arches

B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced
to a very short irqoff section.

There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback
case we only have to do one disable and enable like before.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-05-07 20:25:38 +03:00
Linus Torvalds
a1fde08c74 VM: skip the stack guard page lookup in get_user_pages only for mlock
The logic in __get_user_pages() used to skip the stack guard page lookup
whenever the caller wasn't interested in seeing what the actual page
was.  But Michel Lespinasse points out that there are cases where we
don't care about the physical page itself (so 'pages' may be NULL), but
do want to make sure a page is mapped into the virtual address space.

So using the existence of the "pages" array as an indication of whether
to look up the guard page or not isn't actually so great, and we really
should just use the FOLL_MLOCK bit.  But because that bit was only set
for the VM_LOCKED case (and not all vma's necessarily have it, even for
mlock()), we couldn't do that originally.

Fix that by moving the VM_LOCKED check deeper into the call-chain, which
actually simplifies many things.  Now mlock() gets simpler, and we can
also check for FOLL_MLOCK in __get_user_pages() and the code ends up
much more straightforward.

Reported-and-reviewed-by: Michel Lespinasse <walken@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-04 21:30:28 -07:00
Thomas Gleixner
30106b8ce2 slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg
The SLUB allocator use of the cmpxchg_double logic was wrong: it
actually needs the irq-safe one.

That happens automatically when we use the native unlocked 'cmpxchg8b'
instruction, but when compiling the kernel for older x86 CPUs that do
not support that instruction, we fall back to the generic emulation
code.

And if you don't specify that you want the irq-safe version, the generic
code ends up just open-coding the cmpxchg8b equivalent without any
protection against interrupts or preemption.  Which definitely doesn't
work for SLUB.

This was reported by Werner Landgraf <w.landgraf@ru.ru>, who saw
instability with his distro-kernel that was compiled to support pretty
much everything under the sun.  Most big Linux distributions tend to
compile for PPro and later, and would never have noticed this problem.

This also fixes the prototypes for the irqsafe cmpxchg_double functions
to use 'bool' like they should.

[ Btw, that whole "generic code defaults to no protection" design just
  sounds stupid - if the code needs no protection, there is no reason to
  use "cmpxchg_double" to begin with.  So we should probably just remove
  the unprotected version entirely as pointless.   - Linus ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-and-tested-by: werner <w.landgraf@ru.ru>
Acked-and-tested-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1105041539050.3005@ionos
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-04 14:20:20 -07:00
Mel Gorman
cc03638df2 mm: check if PTE is already allocated during page fault
With transparent hugepage support, handle_mm_fault() has to be careful
that a normal PMD has been established before handling a PTE fault.  To
achieve this, it used __pte_alloc() directly instead of pte_alloc_map as
pte_alloc_map is unsafe to run against a huge PMD.  pte_offset_map() is
called once it is known the PMD is safe.

pte_alloc_map() is smart enough to check if a PTE is already present
before calling __pte_alloc but this check was lost.  As a consequence,
PTEs may be allocated unnecessarily and the page table lock taken.  Thi
useless PTE does get cleaned up but it's a performance hit which is
visible in page_test from aim9.

This patch simply re-adds the check normally done by pte_alloc_map to
check if the PTE needs to be allocated before taking the page table lock.
The effect is noticable in page_test from aim9.

  AIM9
                  2.6.38-vanilla 2.6.38-checkptenone
  creat-clo      446.10 ( 0.00%)   424.47 (-5.10%)
  page_test       38.10 ( 0.00%)    42.04 ( 9.37%)
  brk_test        52.45 ( 0.00%)    51.57 (-1.71%)
  exec_test      382.00 ( 0.00%)   456.90 (16.39%)
  fork_test       60.11 ( 0.00%)    67.79 (11.34%)
  MMTests Statistics: duration
  Total Elapsed Time (seconds)                611.90    612.22

(While this affects 2.6.38, it is a performance rather than a functional
bug and normally outside the rules -stable.  While the big performance
differences are to a microbench, the difference in fork and exec
performance may be significant enough that -stable wants to consider the
patch)

Reported-by: Raz Ben Yehuda <raziebe@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28 11:28:21 -07:00
KOSAKI Motohiro
f755a042d8 oom: use pte pages in OOM score
PTE pages eat up memory just like anything else, but we do not account for
them in any way in the OOM scores.  They are also _guaranteed_ to get
freed up when a process is OOM killed, while RSS is not.

Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>		[2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28 11:28:21 -07:00
Andrea Arcangeli
78f11a2557 mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups
The huge_memory.c THP page fault was allowed to run if vm_ops was null
(which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
setup a special vma->vm_ops and it would fallback to regular anonymous
memory) but other THP logics weren't fully activated for vmas with vm_file
not NULL (/dev/zero has a not NULL vma->vm_file).

So this removes the vm_file checks so that /dev/zero also can safely use
THP (the other albeit safer approach to fix this bug would have been to
prevent the THP initial page fault to run if vm_file was set).

After removing the vm_file checks, this also makes huge_memory.c stricter
in khugepaged for the DEBUG_VM=y case.  It doesn't replace the vm_file
check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
only be allowed to exist before the first page fault, and in turn when
vma->anon_vma is null (so preventing khugepaged registration).  So I tend
to think the previous comment saying if vm_file was set, VM_PFNMAP might
have been set and we could still be registered in khugepaged (despite
anon_vma was not NULL to be registered in khugepaged) was too paranoid.
The is_linear_pfn_mapping check is also I think superfluous (as described
by comment) but under DEBUG_VM it is safe to stay.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Caspar Zhang <bugs@casparzhang.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28 11:28:20 -07:00
Jiri Kosina
07f9479a40 Merge branch 'master' into for-next
Fast-forwarded to current state of Linus' tree as there are patches to be
applied for files that didn't exist on the old branch.
2011-04-26 10:22:59 +02:00
Christoph Lameter
8dc16c6c04 slub: Move debug handlign in __slab_free
Its easier to read if its with the check for debugging flags.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-04-17 14:03:20 +03:00
Christoph Lameter
dc1fb7f436 slub: Move node determination out of hotpath
If the node does not change then there is no need to recalculate
the node from the page struct. So move the node determination
into the places where we acquire a new slab page.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-04-17 14:03:20 +03:00
Christoph Lameter
01ad8a7bc2 slub: Eliminate repeated use of c->page through a new page variable
__slab_alloc is full of "c->page" repeats. Lets just use one local variable
named "page" for this. Also avoids the need to a have another variable
called "new".

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-04-17 14:03:19 +03:00
Christoph Lameter
5f80b13ae4 slub: get_map() function to establish map of free objects in a slab
The bit map of free objects in a slab page is determined in various functions
if debugging is enabled.

Provide a common function for that purpose.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-04-17 14:03:19 +03:00
Christoph Lameter
33de04ec4c slub: Use NUMA_NO_NODE in get_partial
A -1 was leftover during the conversion.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-04-17 14:03:19 +03:00
Ben Hutchings
e27e6151b1 mm/thp: use conventional format for boolean attributes
The conventional format for boolean attributes in sysfs is numeric ("0" or
"1" followed by new-line).  Any boolean attribute can then be read and
written using a generic function.  Using the strings "yes [no]", "[yes]
no" (read), "yes" and "no" (write) will frustrate this.

[akpm@linux-foundation.org: use kstrtoul()]
[akpm@linux-foundation.org: test_bit() doesn't return 1/0, per Neil]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: <stable@kernel.org> 	[2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:56 -07:00
KOSAKI Motohiro
341aea2bc4 oom-kill: remove boost_dying_task_prio()
This is an almost-revert of commit 93b43fa ("oom: give the dying task a
higher priority").

That commit dramatically improved oom killer logic when a fork-bomb
occurs.  But I've found that it has nasty corner case.  Now cpu cgroup has
strange default RT runtime.  It's 0!  That said, if a process under cpu
cgroup promote RT scheduling class, the process never run at all.

If an admin inserts a !RT process into a cpu cgroup by setting
rtruntime=0, usually it runs perfectly because a !RT task isn't affected
by the rtruntime knob.  But if it promotes an RT task via an explicit
setscheduler() syscall or an OOM, the task can't run at all.  In short,
the oom killer doesn't work at all if admins are using cpu cgroup and don't
touch the rtruntime knob.

Eventually, kernel may hang up when oom kill occur.  I and the original
author Luis agreed to disable this logic.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:56 -07:00
KOSAKI Motohiro
929bea7c71 vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.

	2006 Sep 25; commit 408d8544; oom: use unreclaimable info

And it went through strange history. firstly, following commit broke
the logic unintentionally.

	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations

Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.

	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0

But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .

	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path

But, recently, Andrey Vagin found the new corner case. Look,

	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}

zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock.  Therefore zones can become a state of
zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
all_unreclaimable() return false even though zone->all_unreclaimabe=1.

This resulted in the kernel hanging up when executing a loop of the form

1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap

as described in
http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725

Is this ignorable minor issue?  No.  Unfortunately, x86 has very small dma
zone and it become zone->all_unreclamble=1 easily.  and if it become
all_unreclaimable=1, it never restore all_unreclaimable=0.  Why?  if
all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0.  that mean no page scan at
all!

Eventually, oom-killer never works on such systems.  That said, we can't
use zone->pages_scanned for this purpose.  This patch restore
all_unreclaimable() use zone->all_unreclaimable as old.  and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of commit
d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:56 -07:00
Michael Ellerman
fe936dfc23 mm: check that we have the right vma in __access_remote_vm()
In __access_remote_vm() we need to check that we have found the right
vma, not the following vma before we try to access it.  Otherwise we
might call the vma's access routine with an address which does not fall
inside the vma.

It was discovered on a current kernel but with an unreleased driver,
from memory it was strace leading to a kernel bad access, but it
obviously depends on what the access implementation does.

Looking at other access implementations I only see:

  $ git grep -A 5 vm_operations|grep access
  arch/powerpc/platforms/cell/spufs/file.c-	.access = spufs_mem_mmap_access,
  arch/x86/pci/i386.c-	.access = generic_access_phys,
  drivers/char/mem.c-	.access = generic_access_phys
  fs/sysfs/bin.c-	.access		= bin_access,

The spufs one looks like it might behave badly given the wrong vma, it
assumes vma->vm_file->private_data is a spu_context, and looks like it
would probably blow up pretty quickly if it wasn't.

generic_access_phys() only uses the vma to check vm_flags and get the
mm, and then walks page tables using the address.  So it should bail on
the vm_flags check, or at worst let you access some other VM_IO mapping.

And bin_access() just proxies to another access implementation.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:55 -07:00
Jiri Kosina
4471a675df brk: COMPAT_BRK: fix detection of randomized brk
5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK")
tried to get the whole logic of brk randomization for legacy
(libc5-based) applications finally right.

It turns out that the way to detect whether brk has actually been
randomized in the end or not introduced by that patch still doesn't work
for those binaries, as reported by Geert:

: /sbin/init from my old m68k ramdisk exists prematurely.
:
: Before the patch:
:
: | brk(0x80005c8e)                         = 0x80006000
:
: After the patch:
:
: | brk(0x80005c8e)                         = 0x80005c8e
:
: Old libc5 considers brk() to have failed if the return value is not
: identical to the requested value.

I don't like it, but currently see no better option than a bit flag in
task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2
case.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:55 -07:00
Hugh Dickins
fc5da22ae3 tmpfs: fix off-by-one in max_blocks checks
If you fill up a tmpfs, df was showing

  tmpfs                   460800         -         -   -  /tmp

because of an off-by-one in the max_blocks checks.  Fix it so df shows

  tmpfs                   460800    460800         0 100% /tmp

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:55 -07:00
Andi Kleen
81ab4201fb mm: add VM counters for transparent hugepages
I found it difficult to make sense of transparent huge pages without
having any counters for its actions.  Add some counters to vmstat for
allocation of transparent hugepages and fallback to smaller pages.

Optional patch, but useful for development and understanding the system.

Contains improvements from Andrea Arcangeli and Johannes Weiner

[akpm@linux-foundation.org: coding-style fixes]
[hannes@cmpxchg.org: fix vmstat_text[] entries]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:55 -07:00
Christoph Lameter
d3bc236718 vmstat: update comment regarding stat_threshold
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:54 -07:00
Paul Mundt
9f6ae448bf mm/page_alloc.c: silence build_all_zonelists() section mismatch
The memory hotplug case involves calling to build_all_zonelists() which
in turns calls in to setup_zone_pageset().  The latter is marked
__meminit while build_all_zonelists() itself has no particular
annotation.  build_all_zonelists() is only handed a non-NULL pointer in
the case of memory hotplug through an existing __meminit path, so the
setup_zone_pageset() reference is always safe.

The options as such are either to flag build_all_zonelists() as __ref (as
per __build_all_zonelists()), or to simply discard the __meminit
annotation from setup_zone_pageset().

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:54 -07:00
Daniel Kiper
584208e6b4 mm: optimize pfn calculation in online_page()
If CONFIG_FLATMEM is enabled pfn is calculated in online_page() more than
once.  It is possible to optimize that and use value established at
beginning of that function.

Signed-off-by: Daniel Kiper <dkiper@net-space.pl>
Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:54 -07:00
Linus Torvalds
a626ca6a65 vm: fix vm_pgoff wrap in stack expansion
Commit 982134ba62 ("mm: avoid wrapping vm_pgoff in mremap()") fixed
the case of a expanding mapping causing vm_pgoff wrapping when you used
mremap.  But there was another case where we expand mappings hiding in
plain sight: the automatic stack expansion.

This fixes that case too.

This one also found by Robert Święcki, using his nasty system call
fuzzer tool.  Good job.

Reported-and-tested-by: Robert Święcki <robert@swiecki.net>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-13 08:07:28 -07:00
Linus Torvalds
95042f9eb7 vm: fix mlock() on stack guard page
Commit 53a7706d5e ("mlock: do not hold mmap_sem for extended periods
of time") changed mlock() to care about the exact number of pages that
__get_user_pages() had brought it.  Before, it would only care about
errors.

And that doesn't work, because we also handled one page specially in
__mlock_vma_pages_range(), namely the stack guard page.  So when that
case was handled, the number of pages that the function returned was off
by one.  In particular, it could be zero, and then the caller would end
up not making any progress at all.

Rather than try to fix up that off-by-one error for the mlock case
specially, this just moves the logic to handle the stack guard page
into__get_user_pages() itself, thus making all the counts come out
right automatically.

Reported-by: Robert Święcki <robert@swiecki.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-12 14:15:51 -07:00
Li Zefan
607bf324ab slub: Fix a typo in config name
There's no config named SLAB_DEBUG, and it should be a typo
of SLUB_DEBUG.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-04-12 22:27:27 +03:00
Justin P. Mattock
6eab04a876 treewide: remove extra semicolons
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-04-10 17:01:05 +02:00
Nikanth Karthikesan
58c2ee4007 mm: Fix section mismatch for setup_zone_pageset()
build_all_zonelists() which is not __meminit, calls setup_zone_pageset().

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-04-10 17:01:03 +02:00
Linus Torvalds
42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Linus Torvalds
982134ba62 mm: avoid wrapping vm_pgoff in mremap()
The normal mmap paths all avoid creating a mapping where the pgoff
inside the mapping could wrap around due to overflow.  However, an
expanding mremap() can take such a non-wrapping mapping and make it
bigger and cause a wrapping condition.

Noticed by Robert Swiecki when running a system call fuzzer, where it
caused a BUG_ON() due to terminally confusing the vma_prio_tree code.  A
vma dumping patch by Hugh then pinpointed the crazy wrapped case.

Reported-and-tested-by: Robert Swiecki <robert@swiecki.net>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-07 07:35:51 -07:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Linus Torvalds
eefbab5995 Merge branch 'frv' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-frv
* 'frv' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-frv:
  FRV: Use generic show_interrupts()
  FRV: Convert genirq namespace
  frv: Select GENERIC_HARDIRQS_NO_DEPRECATED
  frv: Convert cpu irq_chip to new functions
  frv: Convert mb93493 irq_chip to new functions
  frv: Convert mb93093 irq_chip to new function
  frv: Convert mb93091 irq_chip to new functions
  frv: Fix typo from __do_IRQ overhaul
  frv: Remove stale irq_chip.end
  FRV: Do some cleanups
  FRV: Missing node arg in alloc_thread_info_node() macro
  NOMMU: implement access_remote_vm
  NOMMU: support SMP dynamic percpu_alloc
  NOMMU: percpu should use is_vmalloc_addr().
2011-03-29 11:43:30 -07:00
Mike Frysinger
f55f199b7d NOMMU: implement access_remote_vm
Recent vm changes brought in a new function which the core procfs code
utilizes.  So implement it for nommu systems too to avoid link failures.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Simon Horman <horms@verge.net.au>
Tested-by: Ithamar Adema <ithamar.adema@team-embedded.nl>
Acked-by: Greg Ungerer <gerg@uclinux.org>
2011-03-29 14:05:12 +01:00
Mike Frysinger
787e5b06a8 percpu: Cast away printk format warning
On 32-bit systems which don't happen to implicitly define or cast
VMALLOC_START and/or VMALLOC_END to long in their arch headers, the
printk in the percpu code will cause a warning to be emitted:

mm/percpu.c: In function 'pcpu_embed_first_chunk':
mm/percpu.c:1648: warning: format '%lx' expects type 'long unsigned int',
        but argument 3 has type 'unsigned int'

So add an explicit cast to unsigned long here.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-28 18:03:34 +02:00
David Howells
eac522ef43 NOMMU: percpu should use is_vmalloc_addr().
per_cpu_ptr_to_phys() uses VMALLOC_START and VMALLOC_END to determine if an
address is in the vmalloc() region or not.  This is incorrect on NOMMU as
there is no real vmalloc() capability (vmalloc() is emulated by kmalloc()).

The correct way to do this is to use is_vmalloc_addr().  This encapsulates the
vmalloc() region test in MMU mode and just returns 0 in NOMMU mode.

On FRV in NOMMU mode, the percpu compilation fails without this patch:

mm/percpu.c: In function 'per_cpu_ptr_to_phys':
mm/percpu.c:1011: error: 'VMALLOC_START' undeclared (first use in this function)
mm/percpu.c:1011: error: (Each undeclared identifier is reported only once
mm/percpu.c:1011: error: for each function it appears in.)
mm/percpu.c:1012: error: 'VMALLOC_END' undeclared (first use in this function)
mm/percpu.c:1018: warning: control reaches end of non-void function

Signed-off-by: David Howells <dhowells@redhat.com>
2011-03-28 12:53:29 +01:00
Randy Dunlap
ae91dbfc99 mm: fix memory.c incorrect kernel-doc
Fix mm/memory.c incorrect kernel-doc function notation:

  Warning(mm/memory.c:3718): Cannot understand  * @access_remote_vm - access another process' address space
   on line 3718 - I thought it was a doc line

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-27 19:30:18 -07:00
Linus Torvalds
d39dd11c3e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  fs: simplify iget & friends
  fs: pull inode->i_lock up out of writeback_single_inode
  fs: rename inode_lock to inode_hash_lock
  fs: move i_wb_list out from under inode_lock
  fs: move i_sb_list out from under inode_lock
  fs: remove inode_lock from iput_final and prune_icache
  fs: Lock the inode LRU list separately
  fs: factor inode disposal
  fs: protect inode->i_state with inode->i_lock
  autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
  autofs4 - remove autofs4_lock
  autofs4 - fix d_manage() return on rcu-walk
  autofs4 - fix autofs4_expire_indirect() traversal
  autofs4 - fix dentry leak in autofs4_expire_direct()
  autofs4 - reinstate last used update on access
  vfs - check non-mountpoint dentry might block in __follow_mount_rcu()
2011-03-24 19:01:30 -07:00
Dave Chinner
a66979abad fs: move i_wb_list out from under inode_lock
Protect the inode writeback list with a new global lock
inode_wb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24 21:17:51 -04:00
Dave Chinner
250df6ed27 fs: protect inode->i_state with inode->i_lock
Protect inode state transitions and validity checks with the
inode->i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.

This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.

Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode->i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24 21:16:31 -04:00
Linus Torvalds
a735140257 Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  SLUB: Write to per cpu data when allocating it
  slub: Fix debugobjects with lockless fastpath
2011-03-24 17:51:12 -07:00
David Rientjes
b2b755b5f1 lib, arch: add filter argument to show_mem and fix private implementations
Commit ddd588b5dd ("oom: suppress nodes that are not allowed from
meminfo on oom kill") moved lib/show_mem.o out of lib/lib.a, which
resulted in build warnings on all architectures that implement their own
versions of show_mem():

	lib/lib.a(show_mem.o): In function `show_mem':
	show_mem.c:(.text+0x1f4): multiple definition of `show_mem'
	arch/sparc/mm/built-in.o:(.text+0xd70): first defined here

The fix is to remove __show_mem() and add its argument to show_mem() in
all implementations to prevent this breakage.

Architectures that implement their own show_mem() actually don't do
anything with the argument yet, but they could be made to filter nodes
that aren't allowed in the current context in the future just like the
generic implementation.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: James Bottomley <James.Bottomley@hansenpartnership.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-24 17:49:37 -07:00
Christoph Lameter
b8c4c96ed4 SLUB: Write to per cpu data when allocating it
It turns out that the cmpxchg16b emulation has to access vmalloced
percpu memory with interrupts disabled. If the memory has never
been touched before then the fault necessary to establish the
mapping will not to occur and the kernel will fail on boot.

Fix that by reusing the CONFIG_PREEMPT code that writes the
cpu number into a field on every cpu. Writing to the per cpu
area before causes the mapping to be established before we get
to a cmpxchg16b emulation.

Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-24 21:53:07 +02:00
Thomas Gleixner
f9b615de46 slub: Fix debugobjects with lockless fastpath
On Thu, 24 Mar 2011, Ingo Molnar wrote:
> RIP: 0010:[<ffffffff810570a9>]  [<ffffffff810570a9>] get_next_timer_interrupt+0x119/0x260

That's a typical timer crash, but you were unable to debug it with
debugobjects because commit d3f661d6 broke those.

Cc: Christoph Lameter <cl@linux.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-24 21:26:46 +02:00
Tejun Heo
0415b00d17 percpu: Always align percpu output section to PAGE_SIZE
Percpu allocator honors alignment request upto PAGE_SIZE and both the
percpu addresses in the percpu address space and the translated kernel
addresses should be aligned accordingly.  The calculation of the
former depends on the alignment of percpu output section in the kernel
image.

The linker script macros PERCPU_VADDR() and PERCPU() are used to
define this output section and the latter takes @align parameter.
Several architectures are using @align smaller than PAGE_SIZE breaking
percpu memory alignment.

This patch removes @align parameter from PERCPU(), renames it to
PERCPU_SECTION() and makes it always align to PAGE_SIZE.  While at it,
add PCPU_SETUP_BUG_ON() checks such that alignment problems are
reliably detected and remove percpu alignment comment recently added
in workqueue.c as the condition would trigger BUG way before reaching
there.

For um, this patch raises the alignment of percpu area.  As the area
is in .init, there shouldn't be any noticeable difference.

This problem was discovered by David Howells while debugging boot
failure on mn10300.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: uclinux-dist-devel@blackfin.uclinux.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: user-mode-linux-devel@lists.sourceforge.net
2011-03-24 18:50:09 +01:00
Linus Torvalds
6c51038900 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
  Documentation/iostats.txt: bit-size reference etc.
  cfq-iosched: removing unnecessary think time checking
  cfq-iosched: Don't clear queue stats when preempt.
  blk-throttle: Reset group slice when limits are changed
  blk-cgroup: Only give unaccounted_time under debug
  cfq-iosched: Don't set active queue in preempt
  block: fix non-atomic access to genhd inflight structures
  block: attempt to merge with existing requests on plug flush
  block: NULL dereference on error path in __blkdev_get()
  cfq-iosched: Don't update group weights when on service tree
  fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
  block: Require subsystems to explicitly allocate bio_set integrity mempool
  jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  fs: make fsync_buffers_list() plug
  mm: make generic_writepages() use plugging
  blk-cgroup: Add unaccounted time to timeslice_used.
  block: fixup plugging stubs for !CONFIG_BLOCK
  block: remove obsolete comments for blkdev_issue_zeroout.
  blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
  ...

Fix up conflicts in fs/{aio.c,super.c}
2011-03-24 10:16:26 -07:00
Linus Torvalds
b81a618dcd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  deal with races in /proc/*/{syscall,stack,personality}
  proc: enable writing to /proc/pid/mem
  proc: make check_mem_permission() return an mm_struct on success
  proc: hold cred_guard_mutex in check_mem_permission()
  proc: disable mem_write after exec
  mm: implement access_remote_vm
  mm: factor out main logic of access_process_vm
  mm: use mm_struct to resolve gate vma's in __get_user_pages
  mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
  mm: arch: make in_gate_area take an mm_struct instead of a task_struct
  mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
  x86: mark associated mm when running a task in 32 bit compatibility mode
  x86: add context tag to mark mm when running a task in 32-bit compatibility mode
  auxv: require the target to be tracable (or yourself)
  close race in /proc/*/environ
  report errors in /proc/*/*map* sanely
  pagemap: close races with suid execve
  make sessionid permissions in /proc/*/task/* match those in /proc/*
  fix leaks in path_lookupat()

Fix up trivial conflicts in fs/proc/base.c
2011-03-23 20:51:42 -07:00
Olaf Hering
93a72052be crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn
The Xen PV drivers in a crashed HVM guest can not connect to the dom0
backend drivers because both frontend and backend drivers are still in
connected state.  To run the connection reset function only in case of a
crashdump, the is_kdump_kernel() function needs to be available for the PV
driver modules.

Consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn into
kernel/crash_dump.c Also export elfcorehdr_addr to make is_kdump_kernel()
usable for modules.

Leave 'elfcorehdr' as early_param().  This changes powerpc from __setup()
to early_param().  It adds an address range check from x86 also on ia64
and powerpc.

[akpm@linux-foundation.org: additional #includes]
[akpm@linux-foundation.org: remove elfcorehdr_addr export]
[akpm@linux-foundation.org: fix for Tejun's mm/nobootmem.c changes]
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:47:19 -07:00
David Rientjes
f9434ad155 memcg: give current access to memory reserves if it's trying to die
When a memcg is oom and current has already received a SIGKILL, then give
it access to memory reserves with a higher scheduling priority so that it
may quickly exit and free its memory.

This is identical to the global oom killer and is done even before
checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
selected is guaranteed to have come from userspace; the thread only needs
access to memory reserves to exit and thus we don't unnecessarily panic
the machine until the kernel has no last resort to free memory.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:33 -07:00
KAMEZAWA Hiroyuki
5a6475a4e1 memcg: fix leak on wrong LRU with FUSE
fs/fuse/dev.c::fuse_try_move_page() does

   (1) remove a page by ->steal()
   (2) re-add the page to page cache
   (3) link the page to LRU if it was not on LRU at (1)

This implies the page is _on_ LRU when it's added to radix-tree.  So, the
page is added to memory cgroup while it's on LRU.  because LRU is lazy and
no one flushs it.

This is the same behavior as SwapCache and needs special care as
 - remove page from LRU before overwrite pc->mem_cgroup.
 - add page to LRU after overwrite pc->mem_cgroup.

And we need to taking care of pagevec.

If PageLRU(page) is set before we add PCG_USED bit, the page will not be
added to memcg's LRU (in short period).  So, regardlress of PageLRU(page)
value before commit_charge(), we need to check PageLRU(page) after
commit_charge().

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Balbir Singh <balbir@in.ibm.com>
Reported-by: Daniel Poelzleithner <poelzi@poelzi.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:33 -07:00
Michal Hocko
6cfddb2615 memcg: page_cgroup array is never stored on reserved pages
KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
PageReserved because we never store the array on reserved pages (neither
alloc_pages_exact nor vmalloc use those pages).

So we can replace the check by a BUG_ON.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:33 -07:00
Michal Hocko
dde79e005a page_cgroup: reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
Currently we are allocating a single page_cgroup array per memory section
(stored in mem_section->base) when CONFIG_SPARSEMEM is selected.  This is
correct but memory inefficient solution because the allocated memory
(unless we fall back to vmalloc) is not kmalloc friendly:

        - 32b - 16384 entries (20B per entry) fit into 327680B so the
          524288B slab cache is used
        - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
        - 64b - 32768 entries (40B per entry) fit into 2097152 cache

This is ~37% wasted space per memory section and it sumps up for the whole
memory.  On a x86_64 machine it is something like 6MB per 1GB of RAM.

We can reduce the internal fragmentation by using alloc_pages_exact which
allocates PAGE_SIZE aligned blocks so we will get down to <4kB wasted
memory per section which is much better.

We still need a fallback to vmalloc because we have no guarantees that we
will have a continuous memory of that size (order-10) later on during the
hotplug events.

[hannes@cmpxchg.org: do not define unused free_page_cgroup() without memory hotplug]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:32 -07:00
Andrew Morton
4be4489fea mm/memcontrol.c: suppress uninitialized-var warning with older gcc's
mm/memcontrol.c: In function 'mem_cgroup_force_empty':
mm/memcontrol.c:2280: warning: 'flags' may be used uninitialized in this function

It's a false positive.

Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:32 -07:00
Johannes Weiner
7a159cc9d7 memcg: use native word page statistics counters
The statistic counters are in units of pages, there is no reason to make
them 64-bit wide on 32-bit machines.

Make them native words.  Since they are signed, this leaves 31 bit on
32-bit machines, which can represent roughly 8TB assuming a page size of
4k.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:31 -07:00
Johannes Weiner
e9f8974f2f memcg: break out event counters from other stats
For increasing and decreasing per-cpu cgroup usage counters it makes sense
to use signed types, as single per-cpu values might go negative during
updates.  But this is not the case for only-ever-increasing event
counters.

All the counters have been signed 64-bit so far, which was enough to count
events even with the sign bit wasted.

This patch:
- divides s64 counters into signed usage counters and unsigned
  monotonically increasing event counters.
- converts unsigned event counters into 'unsigned long' rather than
  'u64'.  This matches the type used by the /proc/vmstat event counters.

The next patch narrows the signed usage counters type (on 32-bit CPUs,
that is).

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:31 -07:00
Johannes Weiner
7ec99d6213 memcg: unify charge/uncharge quantities to units of pages
There is no clear pattern when we pass a page count and when we pass a
byte count that is a multiple of PAGE_SIZE.

We never charge or uncharge subpage quantities, so convert it all to page
counts.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:30 -07:00
Johannes Weiner
7ffd4ca7a2 memcg: convert uncharge batching from bytes to page granularity
We never uncharge subpage quantities.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:30 -07:00
Johannes Weiner
11c9ea4e80 memcg: convert per-cpu stock from bytes to page granularity
We never keep subpage quantities in the per-cpu stock.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:29 -07:00
Johannes Weiner
e7018b8d27 memcg: keep only one charge cancelling function
We have two charge cancelling functions: one takes a page count, the other
a page size.  The second one just divides the parameter by PAGE_SIZE and
then calls the first one.  This is trivial, no need for an extra function.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:29 -07:00
Johannes Weiner
bf1ff2635a memcg: remove memcg->reclaim_param_lock
The reclaim_param_lock is only taken around single reads and writes to
integer variables and is thus superfluous.  Drop it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:29 -07:00
Johannes Weiner
4dc03de1b2 memcg: charged pages always have valid per-memcg zone info
page_cgroup_zoneinfo() will never return NULL for a charged page, remove
the check for it in mem_cgroup_get_reclaim_stat_from_page().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:28 -07:00
Johannes Weiner
6b3ae58efc memcg: remove direct page_cgroup-to-page pointer
In struct page_cgroup, we have a full word for flags but only a few are
reserved.  Use the remaining upper bits to encode, depending on
configuration, the node or the section, to enable page_cgroup-to-page
lookups without a direct pointer.

This saves a full word for every page in a system with memory cgroups
enabled.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:28 -07:00
Johannes Weiner
5564e88ba6 memcg: condense page_cgroup-to-page lookup points
The per-cgroup LRU lists string up 'struct page_cgroup's.  To get from
those structures to the page they represent, a lookup is required.
Currently, the lookup is done through a direct pointer in struct
page_cgroup, so a lot of functions down the callchain do this lookup by
themselves instead of receiving the page pointer from their callers.

The next patch removes this pointer, however, and the lookup is no longer
that straight-forward.  In preparation for that, this patch only leaves
the non-optional lookups when coming directly from the LRU list and passes
the page down the stack.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:27 -07:00
Johannes Weiner
de3638d9cd memcg: fold __mem_cgroup_move_account into caller
It is one logical function, no need to have it split up.

Also, get rid of some checks from the inner function that ensured the
sanity of the outer function.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:27 -07:00
Johannes Weiner
97a6c37b34 memcg: change page_cgroup_zoneinfo signature
Instead of passing a whole struct page_cgroup to this function, let it
take only what it really needs from it: the struct mem_cgroup and the
page.

This has the advantage that reading pc->mem_cgroup is now done at the same
place where the ordering rules for this pointer are enforced and
explained.

It is also in preparation for removing the pc->page backpointer.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:26 -07:00
Johannes Weiner
ad324e9447 memcg: no uncharged pages reach page_cgroup_zoneinfo
This patch series removes the direct page pointer from struct page_cgroup,
which saves 20% of per-page memcg memory overhead (Fedora and Ubuntu
enable memcg per default, openSUSE apparently too).

The node id or section number is encoded in the remaining free bits of
pc->flags which allows calculating the corresponding page without the
extra pointer.

I ran, what I think is, a worst-case microbenchmark that just cats a large
sparse file to /dev/null, because it means that walking the LRU list on
behalf of per-cgroup reclaim and looking up pages from page_cgroups is
happening constantly and at a high rate.  But it made no measurable
difference.  A profile reported a 0.11% share of the new
lookup_cgroup_page() function in this benchmark.

This patch:

All callsites check PCG_USED before passing pc->mem_cgroup, so the latter
is never NULL.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:26 -07:00
Daisuke Nishimura
f212ad7cf9 memcg: add memcg sanity checks at allocating and freeing pages
Add checks at allocating or freeing a page whether the page is used (iow,
charged) from the view point of memcg.

This check may be useful in debugging a problem and we did similar checks
before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot).

This patch adds some overheads at allocating or freeing memory, so it's
enabled only when CONFIG_DEBUG_VM is enabled.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:25 -07:00
Johannes Weiner
af4a662144 memcg: remove NULL check from lookup_page_cgroup() result
The page_cgroup array is set up before even fork is initialized.  I
seriously doubt that this code executes before the array is alloc'd.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:25 -07:00
Johannes Weiner
c14f35c70e memcg: remove impossible conditional when committing
No callsite ever passes a NULL pointer for a struct mem_cgroup * to the
committing function.  There is no need to check for it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:24 -07:00
Johannes Weiner
3403968d7a memcg: remove unused page flag bitfield defines
These definitions have been unused since '4b3bde4 memcg: remove the
overhead associated with the root cgroup'.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:24 -07:00
Johannes Weiner
9d11ea9f16 memcg: simplify the way memory limits are checked
Since transparent huge pages, checking whether memory cgroups are below
their limits is no longer enough, but the actual amount of chargeable
space is important.

To not have more than one limit-checking interface, replace
memory_cgroup_check_under_limit() and memory_cgroup_check_margin() with a
single memory_cgroup_margin() that returns the chargeable space and leaves
the comparison to the callsite.

Soft limits are now checked the other way round, by using the already
existing function that returns the amount by which soft limits are
exceeded: res_counter_soft_limit_excess().

Also remove all the corresponding functions on the res_counter side that
are now no longer used.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:23 -07:00
Johannes Weiner
b7c6167848 memcg: soft limit reclaim should end at limit not below
Soft limit reclaim continues until the usage is below the current soft
limit, but the documented semantics are actually that soft limit reclaim
will push usage back until the soft limits are met again.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:23 -07:00
KAMEZAWA Hiroyuki
56039efa18 memcg: fix ugly initialization of return value is in caller
Remove initialization of vaiable in caller of memory cgroup function.
Actually, it's return value of memcg function but it's initialized in
caller.

Some memory cgroup uses following style to bring the result of start
function to the end function for avoiding races.

   mem_cgroup_start_A(&(*ptr))
   /* Something very complicated can happen here. */
   mem_cgroup_end_A(*ptr)

In some calls, *ptr should be initialized to NULL be caller.  But it's
ugly.  This patch fixes that *ptr is initialized by _start function.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:22 -07:00
Stephen Wilson
5ddd36b9c5 mm: implement access_remote_vm
Provide an alternative to access_process_vm that allows the caller to obtain a
reference to the supplied mm_struct.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:57 -04:00
Stephen Wilson
206cb63657 mm: factor out main logic of access_process_vm
Introduce an internal helper __access_remote_vm and base access_process_vm on
top of it.  This new method may be called with a NULL task_struct if page fault
accounting is not desired.  This code will be shared with a new address space
accessor that is independent of task_struct.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:56 -04:00
Stephen Wilson
e7f22e207b mm: use mm_struct to resolve gate vma's in __get_user_pages
We now check if a requested user page overlaps a gate vma using the supplied mm
instead of the supplied task.  The given task is now used solely for accounting
purposes and may be NULL.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:56 -04:00
Stephen Wilson
cae5d39032 mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
Now that gate vma's are referenced with respect to a particular mm and not a
particular task it only makes sense to propagate the change to this predicate as
well.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:55 -04:00
Stephen Wilson
83b964bbf8 mm: arch: make in_gate_area take an mm_struct instead of a task_struct
Morally, the question of whether an address lies in a gate vma should be asked
with respect to an mm, not a particular task.  Moreover, dropping the dependency
on task_struct will help make existing and future operations on mm's more
flexible and convenient.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:54 -04:00
Stephen Wilson
31db58b3ab mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
Morally, the presence of a gate vma is more an attribute of a particular mm than
a particular task.  Moreover, dropping the dependency on task_struct will help
make both existing and future operations on mm's more flexible and convenient.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:54 -04:00
Cesar Eduardo Barros
2130781e2a sys_swapon: fix inode locking
A conflict between 52c50567d8 ("mm: swap: unlock swapfile inode mutex
before closing file on bad swapfiles") and 83ef99befc ("sys_swapon:
remove did_down variable") caused a double unlock of the inode mutex
(once in bad_swap: before the filp_close, once at the end just before
returning).

The patch which added the extra unlock cleared did_down to avoid
unlocking twice, but the other patch removed the did_down variable.

To fix, set inode to NULL after the first unlock, since it will be used
after that point only for the final unlock.

While checking this patch, I found a path which could unlock without
locking, in case the same inode was added as a swapfile twice. To fix,
move the setting of the inode variable further down, to just before
claim_swapfile, which will lock the inode before doing anything else.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 07:54:22 -07:00
Shaohua Li
3dd7ae8ec0 mm: simplify code of swap.c
Clean up code and remove duplicate code. Next patch will use
pagevec_lru_move_fn introduced here too.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:09 -07:00
Hugh Dickins
bee4c36a5c shmem: let shared anonymous be nonlinear again
Up to 2.6.22, you could use remap_file_pages(2) on a tmpfs file or a
shared mapping of /dev/zero or a shared anonymous mapping.  In 2.6.23 we
disabled it by default, but set VM_CAN_NONLINEAR to enable it on safe
mappings.  We made sure to set it in shmem_mmap() for tmpfs files, but
missed it in shmem_zero_setup() for the others.  Fix that at last.

Reported-by: Kenny Simpson <theonetruekenny@yahoo.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:09 -07:00
Benjamin Herrenschmidt
8f7a66051b mm/memblock: properly handle overlaps and fix error path
Currently memblock_reserve() or memblock_free() don't handle overlaps of
any kind.  There is some special casing for coalescing exactly adjacent
regions but that's about it.

This is annoying because typically memblock_reserve() is used to mark
regions passed by the firmware as reserved and we all know how much we can
trust our firmwares...

Also, with the current code, if we do something it doesn't handle right
such as trying to memblock_reserve() a large range spanning multiple
existing smaller reserved regions for example, or doing overlapping
reservations, it can silently corrupt the internal region array, causing
odd errors much later on, such as allocations returning reserved regions
etc...

This patch rewrites the underlying functions that add or remove a region
to the arrays.  The new code is a lot more robust as it fully handles
overlapping regions.  It's also, imho, simpler than the previous
implementation.

In addition, while doing so, I found a bug where if we fail to double the
array while adding a region, we would remove the last region of the array
rather than the region we just allocated.  This fixes it too.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:09 -07:00
Kirill A. Shutemov
84be48d84a mm/page_alloc.c: use list_move() instead of list_del()/list_add() combination
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:09 -07:00
Namhyung Kim
a42931bf9c vmalloc: remove confusing comment on vwrite()
KM_USER1 is never used for vwrite() path so the caller doesn't need to
guarantee it is not used.  Only the caller should guarantee is KM_USER0
and it is commented already.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:09 -07:00
Jun'ichi Nomura
cf15b07cf4 writeback: make mapping->writeback_index to point to the last written page
For range-cyclic writeback (e.g.  kupdate), the writeback code sets a
continuation point of the next writeback to mapping->writeback_index which
is set the page after the last written page.  This happens so that we
evenly write the whole file even if pages in it get continuously
redirtied.

However, in some cases, sequential writer is writing in the middle of the
page and it just redirties the last written page by continuing from that.
For example with an application which uses a file as a big ring buffer we
see:

[1st writeback session]
       ...
       flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898514 + 8
       flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898522 + 8
       flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898530 + 8
       flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898538 + 8
       flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898546 + 8
     kworker/0:1-11    4571: block_rq_issue: 8,0 W 0 () 94898514 + 40
>>     flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898554 + 8
>>     flush-8:0-2743  4571: block_rq_issue: 8,0 W 0 () 94898554 + 8

[2nd writeback session after 35sec]
       flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898562 + 8
       flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898570 + 8
       flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898578 + 8
       ...
     kworker/0:1-11    4606: block_rq_issue: 8,0 W 0 () 94898562 + 640
     kworker/0:1-11    4606: block_rq_issue: 8,0 W 0 () 94899202 + 72
       ...
       flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899962 + 8
       flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899970 + 8
       flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899978 + 8
       flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899986 + 8
       flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899994 + 8
     kworker/0:1-11    4606: block_rq_issue: 8,0 W 0 () 94899962 + 40
>>     flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898554 + 8
>>     flush-8:0-2743  4606: block_rq_issue: 8,0 W 0 () 94898554 + 8

So we seeked back to 94898554 after we wrote all the pages at the end of
the file.

This extra seek seems unnecessary.  If we continue writeback from the last
written page, we can avoid it and do not cause harm to other cases.  The
original intent of even writeout over the whole file is preserved and if
the page does not get redirtied pagevec_lookup_tag() just skips it.

As an exceptional case, when I/O error happens, set done_index to the next
page as the comment in the code suggests.

Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:09 -07:00
Cesar Eduardo Barros
24b8ff7c27 mm: remove inline from scan_swap_map()
scan_swap_map() is a large function (224 lines), with several loops and a
complex control flow involving several gotos.

Given all that, it is a bit silly that it is marked as inline.  The
compiler agrees with me: on a x86-64 compile, it did not inline the
function.

Remove the "inline" and let the compiler decide instead.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:09 -07:00
Cesar Eduardo Barros
40531542e2 sys_swapon: separate final enabling of the swapfile
The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(). Move
this code to a separate function, and use it both in sys_swapon and
sys_swapoff.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:08 -07:00
Cesar Eduardo Barros
c6a2b64ba5 sys_swapoff: change order to match sys_swapon
The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(), except
for the order of the operations within the lock. Since the order should
not matter, arbitrarily change sys_swapoff to match sys_swapon, in
preparation to making both share the same code.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:08 -07:00
Cesar Eduardo Barros
c69dbfb84e sys_swapon: move printk outside lock
The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(). To be
able to make both share the same code, move the printk() call in the
middle of it to just after it.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:08 -07:00
Cesar Eduardo Barros
9c8100ef26 sys_swapon: remove nr_good_pages variable
It still exists within setup_swap_map_and_extents(), but after it
nr_good_pages == p->pages.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:08 -07:00
Cesar Eduardo Barros
bdb8e3f683 sys_swapon: simplify error flow in setup_swap_map_and_extents()
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:08 -07:00
Cesar Eduardo Barros
915d4d7bc0 sys_swapon: separate parsing of bad blocks and extents
Move the code which parses the bad block list and the extents to a
separate function. Only code movement, no functional changes.

This change uses the fact that, after the success path, nr_good_pages ==
p->pages.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:08 -07:00
Cesar Eduardo Barros
1421ef3cd1 sys_swapon: call swap_cgroup_swapon() earlier
The call to swap_cgroup_swapon is in the middle of loading the swap map
and extents. As it only does memory allocation and does not depend on
the swapfile layout (map/extents), it can be called earlier (or later).

Move it to just after the allocation of swap_map, since it is
conceptually similar (allocates a map).

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:08 -07:00
Cesar Eduardo Barros
3871902538 sys_swapon: simplify error flow in read_swap_header()
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:08 -07:00
Cesar Eduardo Barros
ca8bd38bf6 sys_swapon: separate parsing of swapfile header
Move the code which parses and checks the swapfile header (except for
the bad block list) to a separate function. Only code movement, no
functional changes.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
5de771e41f sys_swapon: move setting of swapfilepages near use
There is no reason I can see to read inode->i_size long before it is
needed. Move its read to just before it is needed, to reduce the
variable lifetime.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
87ade72a79 sys_swapon: simplify error flow in claim_swapfile()
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
4d0e1e1075 sys_swapon: separate bdev claim and inode lock
Move the code which claims the bdev (S_ISBLK) or locks the inode
(S_ISREG) to a separate function. Only code movement, no functional
changes.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
bd69010b04 sys_swapon: use a single error label
sys_swapon currently has two error labels, bad_swap and bad_swap_2.
bad_swap does the same as bad_swap_2 plus destroy_swap_extents() and
swap_cgroup_swapoff(); both are noops in the places where bad_swap_2 is
jumped to. With a single extra test for inode (matching the one in the
S_ISREG case below), all the error paths in the function can go to
bad_swap.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
9b01c350af sys_swapon: do only cleanup in the cleanup blocks
The only way error is 0 in the cleanup blocks is when the function is
returning successfully. In this case, the cleanup blocks were setting
S_SWAPFILE in the S_ISREG case. But this is not a cleanup.

Move the setting of S_SWAPFILE to just before the "goto out;" to make
this more clear. At this point, we do not need to test for inode because
it will never be NULL.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
f2090d2df5 sys_swapon: remove bdev variable
The bdev variable is always equivalent to (S_ISBLK(inode->i_mode) ?
p->bdev : NULL), as long as it being set is moved to a bit earlier. Use
this fact to remove the bdev variable.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
7de7fb6b34 sys_swapon: move setting of error nearer use
Move the setting of the error variable nearer the goto in a few places.

Avoids calling PTR_ERR() if not IS_ERR() in two places, and makes the
error condition more explicit in two other places.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
83ef99befc sys_swapon: remove did_down variable
Since mutex_lock(&inode->i_mutex) is called just after setting inode,
did_down is always equivalent to (inode && S_ISREG(inode->i_mode)).

Use this fact to remove the did_down variable.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
28b36bd741 sys_swapon: remove initial value of name variable
Now there is nothing which jumps to the cleanup blocks before the name
variable is set. There is no need to set it initially to NULL anymore.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:07 -07:00
Cesar Eduardo Barros
730c0581c8 sys_swapon: simplify error flow in alloc_swap_info()
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:06 -07:00
Cesar Eduardo Barros
2542e5134d sys_swapon: simplify error return from swap_info allocation
At this point in sys_swapon, there is nothing to free. Return directly
instead of jumping to the cleanup block at the end of the function.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:06 -07:00
Cesar Eduardo Barros
53cbb2435f sys_swapon: separate swap_info allocation
Move the swap_info allocation to its own function. Only code movement,
no functional changes.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:06 -07:00
Cesar Eduardo Barros
e8e6c2ec40 sys_swapon: do not depend on "type" after allocation
Within sys_swapon, after the swap_info entry has been allocated, we
always have type == p->type and swap_info[type] == p. Use this fact to
reduce the dependency on the "type" local variable within the function,
as a preparation to move the allocation of the swap_info entry to a
separate function.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:06 -07:00
Cesar Eduardo Barros
80b0df12b8 sys_swapon: remove changelog from function comment
Changelogs belong in the git history instead of in the source code.

Also, "The swapon system call" is redundant with
"SYSCALL_DEFINE2(swapon, ...)".

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Gaah. That's a _historical_ comment. But the patch-series depends on removal ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:05 -07:00
Cesar Eduardo Barros
803d0c8351 sys_swapon: use vzalloc() instead of vmalloc/memset
This patch series refactors the sys_swapon function.

sys_swapon is currently a very large function, with 313 lines (more than
12 25-line screens), which can make it a bit hard to read. This patch
series reduces this size by half, by extracting large chunks of related
code to new helper functions.

One of these chunks of code was nearly identical to the part of
sys_swapoff which is used in case of a failure return from
try_to_unuse(), so this patch series also makes both share the same
code.

As a side effect of all this refactoring, the compiled code gets a bit
smaller (from v1 of this patch series):

   text       data        bss        dec        hex    filename
  14012        944        276      15232       3b80    mm/swapfile.o.before
  13941        944        276      15161       3b39    mm/swapfile.o.after

This patch:

Use vzalloc() instead of vmalloc/memset.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:05 -07:00
Andi Kleen
cc5d462f77 mm: use __GFP_OTHER_NODE for transparent huge pages
Pass __GFP_OTHER_NODE for transparent hugepages NUMA allocations done by the
hugepages daemon.  This way the low level accounting for local versus
remote pages works correctly.

Contains improvements from Andrea Arcangeli

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:05 -07:00
Andi Kleen
78afd5612d mm: add __GFP_OTHER_NODE flag
Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
zone_statistics() that an allocation is on behalf of another thread.  This
way the local and remote counters can be still correct, even when
background daemons like khugepaged are changing memory mappings.

This only affects the accounting, but I think it's worth doing that right
to avoid confusing users.

I first tried to just pass down the right node, but this required a lot of
changes to pass down this parameter and at least one addition of a 10th
argument to a 9 argument function.  Using the flag is a lot less
intrusive.

Open: should be also used for migration?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:05 -07:00
Andrea Arcangeli
11bc82d67d mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback
__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
to succeed as they have graceful fallback.  Waiting for I/O in those,
tends to be overkill in terms of latencies, so we can reduce their latency
by disabling sync migrate.

Unfortunately, even with async migration it's still possible for the
process to be blocked waiting for a request slot (e.g.  get_request_wait
in the block layer) when ->writepage is called.  To prevent
__GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on
dirty page cache for asynchronous migration.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142

[mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: Clemens Ladisch <cladisch@googlemail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Tested-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:05 -07:00
Andrea Arcangeli
b2eef8c0d0 mm: compaction: minimise the time IRQs are disabled while isolating pages for migration
compaction_alloc() isolates pages for migration in isolate_migratepages.
While it's scanning, IRQs are disabled on the mistaken assumption the
scanning should be short.  Tests show this to be true for the most part
but contention times on the LRU lock can be increased.  Before this patch,
the IRQ disabled times for a simple test looked like

  Total sampled time IRQs off (not real total time): 5493
  Event shrink_inactive_list..shrink_zone                  1596 us count 1
  Event shrink_inactive_list..shrink_zone                  1530 us count 1
  Event shrink_inactive_list..shrink_zone                   956 us count 1
  Event shrink_inactive_list..shrink_zone                   541 us count 1
  Event shrink_inactive_list..shrink_zone                   531 us count 1
  Event split_huge_page..add_to_swap                        232 us count 1
  Event save_args..call_softirq                              36 us count 1
  Event save_args..call_softirq                              35 us count 2
  Event __wake_up..__wake_up                                  1 us count 1

This patch reduces the worst-case IRQs-disabled latencies by releasing the
lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if
necessary. The cost of this is that the processing performing compaction will
be slower but IRQs being disabled for too long a time has worse consequences
as the following report shows;

  Total sampled time IRQs off (not real total time): 4367
  Event shrink_inactive_list..shrink_zone                   881 us count 1
  Event shrink_inactive_list..shrink_zone                   875 us count 1
  Event shrink_inactive_list..shrink_zone                   868 us count 1
  Event shrink_inactive_list..shrink_zone                   555 us count 1
  Event split_huge_page..add_to_swap                        495 us count 1
  Event compact_zone..compact_zone_order                    269 us count 1
  Event split_huge_page..add_to_swap                        266 us count 1
  Event shrink_inactive_list..shrink_zone                    85 us count 1
  Event save_args..call_softirq                              36 us count 2
  Event __wake_up..__wake_up                                  1 us count 1

[akpm@linux-foundation.org: simplify with s/unlocked/locked/]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: Clemens Ladisch <cladisch@googlemail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:05 -07:00
Mel Gorman
602605a42e mm: compaction: minimise the time IRQs are disabled while isolating free pages
compaction_alloc() isolates free pages to be used as migration targets.
While its scanning, IRQs are disabled on the mistaken assumption the
scanning should be short.  Analysis showed that IRQs were in fact being
disabled for substantial time.  A simple test was run using large
anonymous mappings with transparent hugepage support enabled to trigger
frequent compactions.  A monitor sampled what the worst IRQ-off latencies
were and a post-processing tool found the following;

  Total sampled time IRQs off (not real total time): 22355
  Event compaction_alloc..compaction_alloc                 8409 us count 1
  Event compaction_alloc..compaction_alloc                 7341 us count 1
  Event compaction_alloc..compaction_alloc                 2463 us count 1
  Event compaction_alloc..compaction_alloc                 2054 us count 1
  Event shrink_inactive_list..shrink_zone                  1864 us count 1
  Event shrink_inactive_list..shrink_zone                    88 us count 1
  Event save_args..call_softirq                              36 us count 1
  Event save_args..call_softirq                              35 us count 2
  Event __make_request..__blk_run_queue                      24 us count 1
  Event __alloc_pages_nodemask..__alloc_pages_nodemask        6 us count 1

i.e.  compaction is disabled IRQs for a prolonged period of time - 8ms in
one instance.  The full report generated by the tool can be found at

 http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-vanilla-micro.report

This patch reduces the time IRQs are disabled by simply disabling IRQs at
the last possible minute.  An updated IRQs-off summary report then looks
like;

  Total sampled time IRQs off (not real total time): 5493
  Event shrink_inactive_list..shrink_zone                  1596 us count 1
  Event shrink_inactive_list..shrink_zone                  1530 us count 1
  Event shrink_inactive_list..shrink_zone                   956 us count 1
  Event shrink_inactive_list..shrink_zone                   541 us count 1
  Event shrink_inactive_list..shrink_zone                   531 us count 1
  Event split_huge_page..add_to_swap                        232 us count 1
  Event save_args..call_softirq                              36 us count 1
  Event save_args..call_softirq                              35 us count 2
  Event __wake_up..__wake_up                                  1 us count 1

A full report is again available at

  http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-minimiseirq-free-v1r4-micro.report

As should be obvious, IRQ disabled latencies due to compaction are
almost elimimnated for this particular test.

[aarcange@redhat.com: Fix initialisation of isolated]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: Clemens Ladisch <cladisch@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:05 -07:00
Hugh Dickins
5b280c0cc7 mm: don't return 0 too early from find_get_pages()
Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably
truncate_inode_pages_range() - stop looking further when it returns 0.

But if an interrupt comes just after its radix_tree_gang_lookup_slot(),
especially if we have preemptible RCU enabled, isn't it conceivable that
all 14 pages returned could be removed from the page cache by
shrink_page_list(), before find_get_pages() gets to process them?  So
causing it to return 0 although there may be plenty more pages beyond.

Make find_get_pages() and find_get_pages_tag() check for this unlikely
case, and restart should it occur; but callers of find_get_pages_contig()
have no such expectation, it's okay for that to return 0 early.

I have not seen this in practice, just worried by the possibility.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Salman Qazi <sqazi@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Hugh Dickins
9d8aa4ea85 mm: remove worrying dead code from find_get_pages()
The radix_tree_deref_retry() case in find_get_pages() has a strange little
excrescence, not seen in the other gang lookups: it looks like the start
of an abandoned attempt to guarantee forward progress in a case that
cannot arise.

ret should always be 0 here: if it isn't, then going back to restart will
leak references to pages already gotten.  There used to be a comment
saying nr_found is necessarily 1 here: that's not quite true, but the
radix_tree_deref_retry() case is peculiar to the entry at index 0, when we
race with it being moved out of the radix_tree root or back.

Remove the worrisome two lines, add a brief comment here and in
find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in
find_get_pages() should it ever be seen elsewhere than at 0.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Salman Qazi <sqazi@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Petr Holasek
c033a93c0d hugetlbfs: correct handling of negative input to /proc/sys/vm/nr_hugepages
When the user inserts a negative value into /proc/sys/vm/nr_hugepages it
will cause the kernel to allocate as many hugepages as possible and to
then update /proc/meminfo to reflect this.

This changes the behavior so that the negative input will result in
nr_hugepages value being unchanged.

Signed-off-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Anton Arapov <anton@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Mel Gorman
8afdcece49 mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones
When reclaiming for order-0 pages, kswapd requires that all zones be
balanced.  Each cycle through balance_pgdat() does background ageing on
all zones if necessary and applies equal pressure on the inactive zone
unless a lot of pages are free already.

A "lot of free pages" is defined as a "balance gap" above the high
watermark which is currently 7*high_watermark.  Historically this was
reasonable as min_free_kbytes was small.  However, on systems using huge
pages, it is recommended that min_free_kbytes is higher and it is tuned
with hugeadm --set-recommended-min_free_kbytes.  With the introduction of
transparent huge page support, this recommended value is also applied.  On
X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
expect around 68M of memory to be free.  The Normal zone is approximately
35000 pages so under even normal memory pressure such as copying a large
file, it gets exhausted quickly.  As it is getting exhausted, kswapd
applies pressure equally to all zones, including the DMA32 zone.  DMA32 is
approximately 700,000 pages with a high watermark of around 23,000 pages.
In this situation, kswapd will reclaim around (23000*8 where 8 is the high
watermark + balance gap of 7 * high watermark) pages or 718M of pages
before the zone is ignored.  What the user sees is that free memory far
higher than it should be.

To avoid an excessive number of pages being reclaimed from the larger
zones, explicitely defines the "balance gap" to be either 1% of the zone
or the low watermark for the zone, whichever is smaller.  While kswapd
will check all zones to apply pressure, it'll ignore zones that meets the
(high_wmark + balance_gap) watermark.

To test this, 80G were copied from a partition and the amount of memory
being used was recorded.  A comparison of a patch and unpatched kernel can
be seen at
http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
and shows that kswapd is not reclaiming as much memory with the patch
applied.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Namhyung Kim
7571966189 mempolicy: remove redundant check in __mpol_equal()
The 'flags' field is already checked, no need to do it again.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Dave Hansen
033193275b pagewalk: only split huge pages when necessary
Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
unconditionally split any transparent huge pages it runs in to.  In
practice, that means that anyone doing a

	cat /proc/$pid/smaps

will unconditionally break down every huge page in the process and depend
on khugepaged to re-collapse it later.  This is fairly suboptimal.

This patch changes that behavior.  It teaches each ->pmd_entry handler
(there are five) that they must break down the THPs themselves.  Also, the
_generic_ code will never break down a THP unless a ->pte_entry handler is
actually set.

This means that the ->pmd_entry handlers can now choose to deal with THPs
without breaking them down.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Cc: Michael J Wolf <mjwolf@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Minchan Kim
278df9f451 mm: reclaim invalidated page ASAP
invalidate_mapping_pages is very big hint to reclaimer.  It means user
doesn't want to use the page any more.  So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.

Please, remember that pages in inactive list are working set as well as
active list.  If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages.  It's totally bad.

Now PG_readahead/PG_reclaim is shared.  fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.

In this series, PG_reclaim is used by invalidated page, too.  If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap.  Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.  It
disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time.  So this patch moves ClearPageReadahead.  In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett.  It's due to compound page.  Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU.  but my patch does
ClearPageRelcaim on compound page.  In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.

I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Steven Barrett <damentz@liquorix.net>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Minchan Kim
3f58a82943 memcg: move memcg reclaimable page into tail of inactive list
The rotate_reclaimable_page function moves just written out pages, which
the VM wanted to reclaim, to the end of the inactive list.  That way the
VM will find those pages first next time it needs to free memory.

This patch applies the rule in memcg.  It can help to prevent unnecessary
working page eviction of memcg.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Minchan Kim
315601809d mm: deactivate invalidated pages
Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
workloads(ex, nightly rsync).  That's because the workload makes just
use-once pages and touches pages twice.  It promotes the page into active
list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.  But other OSes
don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.  But it has a
problem.  If kernel meets page is writing during invalidate_mapping_pages,
it can't work.  It makes for application programmer to use it since they
always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable.  At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.  It means we don't
use the page any more.  So let's move the writing page into inactive
list's head if we can't truncate it right now.

Why I move page to head of lru on this patch, Dirty/Writeback page would
be flushed sooner or later.  It can prevent writeout of pageout which is
less effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added his
Signed-off-by.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Ben Gamari <bgamari.foss@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Peter Zijlstra
01d8b20dec mm: simplify anon_vma refcounts
This patch changes the anon_vma refcount to be 0 when the object is free.
It does this by adding 1 ref to being in use in the anon_vma structure
(iow.  the anon_vma->head list is not empty).

This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry on the
list.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Peter Zijlstra
83813267c6 mm: move anon_vma ref out from under CONFIG_foo
We need the anon_vma refcount unconditionally to simplify the anon_vma
lifetime rules.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Peter Zijlstra
9e60109f12 mm: rename drop_anon_vma() to put_anon_vma()
The normal code pattern used in the kernel is: get/put.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Akinobu Mita
7bc32f6f90 mm: debug-pagealloc: fix kconfig dependency warning
Fix kconfig dependency warning to satisfy dependencies:

warning: (PAGE_POISONING) selects DEBUG_PAGEALLOC which has unmet
direct dependencies (DEBUG_KERNEL && ARCH_SUPPORTS_DEBUG_PAGEALLOC &&
(!HIBERNATION || !PPC && !SPARC) && !KMEMCHECK)

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Namhyung Kim
1d16871d8c mm: batch-free pcp list if possible
free_pcppages_bulk() frees pages from pcp lists in a round-robin fashion
by keeping batch_free counter.  But it doesn't need to spin if there is
only one non-empty list.  This can be checked by batch_free ==
MIGRATE_PCPTYPES.

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Minchan Kim
e64a782fec mm: change __remove_from_page_cache()
Now we renamed remove_from_page_cache with delete_from_page_cache.  As
consistency of __remove_from_swap_cache and remove_from_swap_cache, we
change internal page cache handling function name, too.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Minchan Kim
702cfbf93a mm: goodbye remove_from_page_cache()
Now delete_from_page_cache() replaces remove_from_page_cache().  So we
remove remove_from_page_cache so fs or something out of mainline will
notice it when compile time and can fix it.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Minchan Kim
5adc7b518b mm: truncate: change remove_from_page_cache
This patch series changes remove_from_page_cache()'s page ref counting
rule.  Page cache ref count is decreased in delete_from_page_cache().  So
we don't need to decrease the page reference in callers.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Minchan Kim
4c73b1bc6b mm: shmem: change remove_from_page_cache
This patch series changes remove_from_page_cache()'s page ref counting
rule.  Page cache ref count is decreased in delete_from_page_cache().  So
we don't need to decrease the page reference in callers.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Minchan Kim
97cecb5a25 mm: introduce delete_from_page_cache()
Presently we increase the page refcount in add_to_page_cache() but don't
decrease it in remove_from_page_cache().  Such asymmetry adds confusion,
requiring that callers notice it and a comment explaining why they release
a page reference.  It's not a good API.

A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
gave up because reiser4's drop_page() had to unlock the page between
removing it from page cache and doing the page_cache_release().  But now
the situation is changed.  I think at least things in current mainline
don't have any obstacles.  The problem is for out-of-mainline filesystems
- if they have done such things as reiser4, this patch could be a problem
but they will discover this at compile time since we remove
remove_from_page_cache().

This patch:

This function works as just wrapper remove_from_page_cache().  The
difference is that it decreases page references in itself.  So caller have
to make sure it has a page reference before calling.

This patch is ready for removing remove_from_page_cache().

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Miklos Szeredi
ef6a3c6311 mm: add replace_page_cache_page() function
This function basically does:

     remove_from_page_cache(old);
     page_cache_release(old);
     add_to_page_cache_locked(new);

Except it does this atomically, so there's no possibility for the "add" to
fail because of a race.

If memory cgroups are enabled, then the memory cgroup charge is also moved
from the old page to the new.

This function is currently used by fuse to move pages into the page cache
on read, instead of copying the page contents.

[minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Gleb Natapov
318b275fbc mm: allow GUP to fail instead of waiting on a page
GUP user may want to try to acquire a reference to a page if it is already
in memory, but not if IO, to bring it in, is needed.  For example KVM may
tell vcpu to schedule another guest process if current one is trying to
access swapped out page.  Meanwhile, the page will be swapped in and the
guest process, that depends on it, will be able to run again.

This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
FOLL_NOWAIT follow_page flags.  FAULT_FLAG_RETRY_NOWAIT, when used in
conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
instead.

[akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Prarit Bhargava
5fda1bd5b8 mm: notifier_from_errno() cleanup
While looking at some other notifier callbacks I noticed this code could
use a simple cleanup.

notifier_from_errno() no longer needs the if (ret)/else conditional.  That
same conditional is now done in notifier_from_errno().

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:01 -07:00
David Rientjes
cbf978bfb1 oom: suppress nodes that are not allowed from meminfo on page alloc failure
Displaying extremely verbose meminfo for all nodes on the system is
overkill for page allocation failures when the context restricts that
allocation to only a subset of nodes.  We don't particularly care about
the state of all nodes when some are not allowed in the current context,
they can have an abundance of memory but we can't allocate from that part
of memory.

This patch suppresses disallowed nodes from the meminfo dump on a page
allocation failure if the context requires it.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:01 -07:00
David Rientjes
29423e77c0 oom: suppress show_mem() for many nodes in irq context on page alloc failure
When a page allocation failure occurs, show_mem() is called to dump the
state of the VM so users may understand what happened to get into that
condition.

This output, however, can be extremely verbose.  In irq context, it may
result in significant delays that incur NMI watchdog timeouts when the
machine is large (we use CONFIG_NODES_SHIFT > 8 here to define a "large"
machine since the length of the show_mem() output is proportional to the
number of possible nodes).

This patch suppresses the show_mem() call in irq context when the kernel
has CONFIG_NODES_SHIFT > 8.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:01 -07:00
David Rientjes
ddd588b5dd oom: suppress nodes that are not allowed from meminfo on oom kill
The oom killer is extremely verbose for machines with a large number of
cpus and/or nodes.  This verbosity can often be harmful if it causes other
important messages to be scrolled from the kernel log and incurs a
signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
8.

This patch causes only memory information to be displayed for nodes that
are allowed by current's cpuset when dumping the VM state.  Information
for all other nodes is irrelevant to the oom condition; we don't care if
there's an abundance of memory elsewhere if we can't access it.

This only affects the behavior of dumping memory information when an oom
is triggered.  Other dumps, such as for sysrq+m, still display the
unfiltered form when using the existing show_mem() interface.

Additionally, the per-cpu pageset statistics are extremely verbose in oom
killer output, so it is now suppressed.  This removes

	nodes_weight(current->mems_allowed) * (1 + nr_cpus)

lines from the oom killer output.

Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
nodes.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:01 -07:00
Minchan Kim
9d502c1c8d mm/compaction: check migrate_pages's return value instead of list_empty()
Many migrate_page's caller check return value instead of list_empy by
cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED counting").  This patch
makes compaction's migrate_pages consistent with others.  This patch
should not change old behavior.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:00 -07:00
Andrea Arcangeli
d527caf22e mm: compaction: prevent kswapd compacting memory to reduce CPU usage
This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
order > 0") due to reports stating that kswapd CPU usage was higher and
IRQs were being disabled more frequently.  This was reported at
http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

Without this patch applied, CPU usage by kswapd hovers around the 20% mark
according to the tester (Arthur Marsh:
http://www.spinics.net/linux/fedora/alsa-user/msg09899.html).  With this
patch applied, it's around 2%.

The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
triggered by high-order allocations hitting the low watermark for their
order and waking kswapd on kernels with CONFIG_COMPACTION set.  The most
common trigger for this is network cards configured for jumbo frames but
it's also possible it'll be triggered by fork-heavy workloads (order-1)
and some wireless cards which depend on order-1 allocations.

The symptoms for the user will be high CPU usage by kswapd in low-memory
situations which could be confused with another writeback problem.  While
a patch like 5a03b051 may be reintroduced in the future, this patch plays
it safe for now and reverts it.

[mel@csn.ul.ie: Beefed up the changelog]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Tested-by: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: <stable@kernel.org>		[2.6.38.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:00 -07:00
Nick Piggin
89699605fe mm: vmap area cache
Provide a free area cache for the vmalloc virtual address allocator, based
on the algorithm used by the user virtual memory allocator.

This reduces the number of rbtree operations and linear traversals over
the vmap extents in order to find a free area, by starting off at the last
point that a free area was found.

The free area cache is reset if areas are freed behind it, or if we are
searching for a smaller area or alignment than last time.  So allocation
patterns are not changed (verified by corner-case and random test cases in
userspace testing).

This solves a regression caused by lazy vunmap TLB purging introduced in
db64fe02 (mm: rewrite vmap layer).  That patch will leave extents in the
vmap allocator after they are vunmapped, and until a significant number
accumulate that can be flushed in a single batch.  So in a workload that
vmalloc/vfree frequently, a chain of extents will build up from
VMALLOC_START address, which have to be iterated over each time (giving an
O(n) type of behaviour).

After this patch, the search will start from where it left off, giving
closer to an amortized O(1).

This is verified to solve regressions reported Steven in GFS2, and Avi in
KVM.

Hugh's update:

: I tried out the recent mmotm, and on one machine was fortunate to hit
: the BUG_ON(first->va_start < addr) which seems to have been stalling
: your vmap area cache patch ever since May.

: I can get you addresses etc, I did dump a few out; but once I stared
: at them, it was easier just to look at the code: and I cannot see how
: you would be so sure that first->va_start < addr, once you've done
: that addr = ALIGN(max(...), align) above, if align is over 0x1000
: (align was 0x8000 or 0x4000 in the cases I hit: ioremaps like Steve).

: I originally got around it by just changing the
: 		if (first->va_start < addr) {
: to
: 		while (first->va_start < addr) {
: without thinking about it any further; but that seemed unsatisfactory,
: why would we want to loop here when we've got another very similar
: loop just below it?

: I am never going to admit how long I've spent trying to grasp your
: "while (n)" rbtree loop just above this, the one with the peculiar
: 		if (!first && tmp->va_start < addr + size)
: in.  That's unfamiliar to me, I'm guessing it's designed to save a
: subsequent rb_next() in a few circumstances (at risk of then setting
: a wrong cached_hole_size?); but they did appear few to me, and I didn't
: feel I could sign off something with that in when I don't grasp it,
: and it seems responsible for extra code and mistaken BUG_ON below it.

: I've reverted to the familiar rbtree loop that find_vma() does (but
: with va_end >= addr as you had, to respect the additional guard page):
: and then (given that cached_hole_size starts out 0) I don't see the
: need for any complications below it.  If you do want to keep that loop
: as you had it, please add a comment to explain what it's trying to do,
: and where addr is relative to first when you emerge from it.

: Aren't your tests "size <= cached_hole_size" and
: "addr + size > first->va_start" forgetting the guard page we want
: before the next area?  I've changed those.

: I have not changed your many "addr + size - 1 < addr" overflow tests,
: but have since come to wonder, shouldn't they be "addr + size < addr"
: tests - won't the vend checks go wrong if addr + size is 0?

: I have added a few comments - Wolfgang Wander's 2.6.13 description of
: 1363c3cd86 Avoiding mmap fragmentation
: helped me a lot, perhaps a pointer to that would be good too.  And I found
: it easier to understand when I renamed cached_start slightly and moved the
: overflow label down.

: This patch would go after your mm-vmap-area-cache.patch in mmotm.
: Trivially, nobody is going to get that BUG_ON with this patch, and it
: appears to work fine on my machines; but I have not given it anything like
: the testing you did on your original, and may have broken all the
: performance you were aiming for.  Please take a look and test it out
: integrate with yours if you're satisfied - thanks.

[akpm@linux-foundation.org: add locking comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reported-and-tested-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-and-tested-by: Avi Kivity <avi@redhat.com>
Tested-by: "Barry J. Marson" <bmarson@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:00 -07:00
David Rientjes
edd45544c6 oom: avoid deferring oom killer if exiting task is being traced
The oom killer naturally defers killing anything if it finds an eligible
task that is already exiting and has yet to detach its ->mm.  This avoids
unnecessarily killing tasks when one is already in the exit path and may
free enough memory that the oom killer is no longer needed.  This is
detected by PF_EXITING since threads that have already detached its ->mm
are no longer considered at all.

The problem with always deferring when a thread is PF_EXITING, however, is
that it may never actually exit when being traced, specifically if another
task is tracing it with PTRACE_O_TRACEEXIT.  The oom killer does not want
to defer in this case since there is no guarantee that thread will ever
exit without intervention.

This patch will now only defer the oom killer when a thread is PF_EXITING
and no ptracer has stopped its progress in the exit path.  It also ensures
that a child is sacrificed for the chosen parent only if it has a
different ->mm as the comment implies: this ensures that the thread group
leader is always targeted appropriately.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:43:58 -07:00
Andrey Vagin
30e2b41f20 oom: skip zombies when iterating tasklist
We shouldn't defer oom killing if a thread has already detached its ->mm
and still has TIF_MEMDIE set.  Memory needs to be freed, so find kill
other threads that pin the same ->mm or find another task to kill.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:43:58 -07:00
David Rientjes
3a5dda7a17 oom: prevent unnecessary oom kills or kernel panics
This patch prevents unnecessary oom kills or kernel panics by reverting
two commits:

	495789a5 (oom: make oom_score to per-process value)
	cef1d352 (oom: multi threaded process coredump don't make deadlock)

First, 495789a5 (oom: make oom_score to per-process value) ignores the
fact that all threads in a thread group do not necessarily exit at the
same time.

It is imperative that select_bad_process() detect threads that are in the
exit path, specifically those with PF_EXITING set, to prevent needlessly
killing additional tasks.  If a process is oom killed and the thread group
leader exits, select_bad_process() cannot detect the other threads that
are PF_EXITING by iterating over only processes.  Thus, it currently
chooses another task unnecessarily for oom kill or panics the machine when
nothing else is eligible.

By iterating over threads instead, it is possible to detect threads that
are exiting and nominate them for oom kill so they get access to memory
reserves.

Second, cef1d352 (oom: multi threaded process coredump don't make
deadlock) erroneously avoids making the oom killer a no-op when an
eligible thread other than current isfound to be exiting.  We want to
detect this situation so that we may allow that exiting thread time to
exit and free its memory; if it is able to exit on its own, that should
free memory so current is no loner oom.  If it is not able to exit on its
own, the oom killer will nominate it for oom kill which, in this case,
only means it will get access to memory reserves.

Without this change, it is easy for the oom killer to unnecessarily target
tasks when all threads of a victim don't exit before the thread group
leader or, in the worst case, panic the machine.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:43:58 -07:00
Mel Gorman
52c50567d8 mm: swap: unlock swapfile inode mutex before closing file on bad swapfiles
If an administrator tries to swapon a file backed by NFS, the inode mutex is
taken (as it is for any swapfile) but later identified to be a bad swapfile
due to the lack of bmap and tries to cleanup. During cleanup, an attempt is
made to close the file but with inode->i_mutex still held. Closing an NFS
file syncs it which tries to acquire the inode mutex leading to deadlock. If
lockdep is enabled the following appears on the console;

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.38-rc8-autobuild #1
    ---------------------------------------------
    swapon/2192 is trying to acquire lock:
     (&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c

    but task is already holding lock:
     (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    other info that might help us debug this:
    1 lock held by swapon/2192:
     #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    stack backtrace:
    Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1
    Call Trace:
        __lock_acquire+0x2eb/0x1623
        find_get_pages_tag+0x14a/0x174
        pagevec_lookup_tag+0x25/0x2e
        vfs_fsync_range+0x47/0x7c
        lock_acquire+0xd3/0x100
        vfs_fsync_range+0x47/0x7c
        nfs_flush_one+0x0/0xdf [nfs]
        mutex_lock_nested+0x40/0x2b1
        vfs_fsync_range+0x47/0x7c
        vfs_fsync_range+0x47/0x7c
        vfs_fsync+0x1c/0x1e
        nfs_file_flush+0x64/0x69 [nfs]
        filp_close+0x43/0x72
        sys_swapon+0xa39/0xae7
        sysret_check+0x2e/0x69
        system_call_fastpath+0x16/0x1b

This patch releases the mutex if its held before calling filep_close()
so swapon fails as expected without deadlock when the swapfile is backed
by NFS.  If accepted for 2.6.39, it should also be considered a -stable
candidate for 2.6.38 and 2.6.37.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@kernel.org>		[2.6.37+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:43:58 -07:00
Christoph Lameter
4fdccdfbb4 slub: Add statistics for this_cmpxchg_double failures
Add some statistics for debugging.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-22 20:48:04 +02:00
Christoph Lameter
2fd66c517d slub: Add missing irq restore for the OOM path
OOM path is missing the irq restore in the CONFIG_CMPXCHG_LOCAL case.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-22 20:48:04 +02:00
Linus Torvalds
14577beb82 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case
  slab,rcu: don't assume the size of struct rcu_head
  slub,rcu: don't assume the size of struct rcu_head
  slub: automatically reserve bytes at the end of slab
  Lockless (and preemptless) fastpaths for slub
  slub: Get rid of slab_free_hook_irq()
  slub: min_partial needs to be in first cacheline
  slub: fix ksize() build error
  slub: fix kmemcheck calls to match ksize() hints
  Revert "slab: Fix missing DEBUG_SLAB last user"
  mm: Remove support for kmem_cache_name()
2011-03-22 09:36:23 -07:00
Pekka Enberg
e8c500c2b6 Merge branch 'slub/lockless' into for-linus
Conflicts:
	include/linux/slub_def.h
2011-03-20 18:13:26 +02:00
Pekka Enberg
c53badd080 Merge branch 'slab/next' into for-linus 2011-03-20 18:12:03 +02:00
Christoph Lameter
a24c5a0ea9 slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case
The redo label needs #ifdeffery. Fixes the following problem introduced by
commit 8a5ec0ba42 ("Lockless (and preemptless) fastpaths for slub"):

  mm/slub.c: In function 'slab_free':
  mm/slub.c:2124: warning: label 'redo' defined but not used

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-20 18:11:07 +02:00
Linus Torvalds
e16b396ce3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
  doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
  Update cpuset info & webiste for cgroups
  dcdbas: force SMI to happen when expected
  arch/arm/Kconfig: remove one to many l's in the word.
  asm-generic/user.h: Fix spelling in comment
  drm: fix printk typo 'sracth'
  Remove one to many n's in a word
  Documentation/filesystems/romfs.txt: fixing link to genromfs
  drivers:scsi Change printk typo initate -> initiate
  serial, pch uart: Remove duplicate inclusion of linux/pci.h header
  fs/eventpoll.c: fix spelling
  mm: Fix out-of-date comments which refers non-existent functions
  drm: Fix printk typo 'failled'
  coh901318.c: Change initate to initiate.
  mbox-db5500.c Change initate to initiate.
  edac: correct i82975x error-info reported
  edac: correct i82975x mci initialisation
  edac: correct commented info
  fs: update comments to point correct document
  target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
  ...

Trivial conflict in fs/eventpoll.c (spelling vs addition)
2011-03-18 10:37:40 -07:00
Linus Torvalds
ec0afc9311 Merge branch 'kvm-updates/2.6.39' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.39' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (55 commits)
  KVM: unbreak userspace that does not sets tss address
  KVM: MMU: cleanup pte write path
  KVM: MMU: introduce a common function to get no-dirty-logged slot
  KVM: fix rcu usage in init_rmode_* functions
  KVM: fix kvmclock regression due to missing clock update
  KVM: emulator: Fix permission checking in io permission bitmap
  KVM: emulator: Fix io permission checking for 64bit guest
  KVM: SVM: Load %gs earlier if CONFIG_X86_32_LAZY_GS=n
  KVM: x86: Remove useless regs_page pointer from kvm_lapic
  KVM: improve comment on rcu use in irqfd_deassign
  KVM: MMU: remove unused macros
  KVM: MMU: cleanup page alloc and free
  KVM: MMU: do not record gfn in kvm_mmu_pte_write
  KVM: MMU: move mmu pages calculated out of mmu lock
  KVM: MMU: set spte accessed bit properly
  KVM: MMU: fix kvm_mmu_slot_remove_write_access dropping intermediate W bits
  KVM: Start lock documentation
  KVM: better readability of efer_reserved_bits
  KVM: Clear async page fault hash after switching to real mode
  KVM: VMX: Initialize vm86 TSS only once.
  ...
2011-03-17 18:40:35 -07:00
Andrea Arcangeli
ef2b4b95a6 mm: PageBuddy and mapcount robustness
Change the _mapcount value indicating PageBuddy from -2 to -128 for
more robusteness against page_mapcount() undeflows.

Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
ignore the previous retval of PageBuddy().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-17 16:31:13 -07:00
Huang Ying
f58c9df78c mm: remove is_hwpoison_address
Unused.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:27 -03:00
Huang Ying
69ebb83e13 mm: make __get_user_pages return -EHWPOISON for HWPOISON page optionally
Make __get_user_pages return -EHWPOISON for HWPOISON page only if
FOLL_HWPOISON is specified.  With this patch, the interested callers
can distinguish HWPOISON pages from general FAULT pages, while other
callers will still get -EFAULT for all these pages, so the user space
interface need not to be changed.

This feature is needed by KVM, where UCR MCE should be relayed to
guest for HWPOISON page, while instruction emulation and MMIO will be
tried for general FAULT page.

The idea comes from Andrew Morton.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:27 -03:00
Huang Ying
0014bd990e mm: export __get_user_pages
In most cases, get_user_pages and get_user_pages_fast should be used
to pin user pages in memory.  But sometimes, some special flags except
FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in
following patch, KVM needs FOLL_HWPOISON.  To support these users,
__get_user_pages is exported directly.

There are some symbol name conflicts in infiniband driver, fixed them too.

Signed-off-by: Huang Ying <ying.huang@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Michel Lespinasse <walken@google.com>
CC: Roland Dreier <roland@kernel.org>
CC: Ralph Campbell <infinipath@qlogic.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:27 -03:00
Jens Axboe
95f28604a6 fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
We don't have proper reference counting for this yet, so we run into
cases where the device is pulled and we OOPS on flushing the fs data.
This happens even though the dirty inodes have already been
migrated to the default_backing_dev_info.

Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-17 11:13:12 +01:00
Shaohua Li
9b6096a65f mm: make generic_writepages() use plugging
This recovers a performance regression caused by the removal
of the per-device plugging.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-17 10:47:06 +01:00
Linus Torvalds
0f6e0e8448 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
  AppArmor: kill unused macros in lsm.c
  AppArmor: cleanup generated files correctly
  KEYS: Add an iovec version of KEYCTL_INSTANTIATE
  KEYS: Add a new keyctl op to reject a key with a specified error code
  KEYS: Add a key type op to permit the key description to be vetted
  KEYS: Add an RCU payload dereference macro
  AppArmor: Cleanup make file to remove cruft and make it easier to read
  SELinux: implement the new sb_remount LSM hook
  LSM: Pass -o remount options to the LSM
  SELinux: Compute SID for the newly created socket
  SELinux: Socket retains creator role and MLS attribute
  SELinux: Auto-generate security_is_socket_class
  TOMOYO: Fix memory leak upon file open.
  Revert "selinux: simplify ioctl checking"
  selinux: drop unused packet flow permissions
  selinux: Fix packet forwarding checks on postrouting
  selinux: Fix wrong checks for selinux_policycap_netpeer
  selinux: Fix check for xfrm selinux context algorithm
  ima: remove unnecessary call to ima_must_measure
  IMA: remove IMA imbalance checking
  ...
2011-03-16 09:15:43 -07:00
Linus Torvalds
181f977d13 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (93 commits)
  x86, tlb, UV: Do small micro-optimization for native_flush_tlb_others()
  x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation
  x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation()
  x86-64, NUMA: Clean up initmem_init()
  x86-64, NUMA: Fix numa_emulation code with node0 without RAM
  x86-64, NUMA: Revert NUMA affine page table allocation
  x86: Work around old gas bug
  x86-64, NUMA: Better explain numa_distance handling
  x86-64, NUMA: Fix distance table handling
  mm: Move early_node_map[] reverse scan helpers under HAVE_MEMBLOCK
  x86-64, NUMA: Fix size of numa_distance array
  x86: Rename e820_table_* to pgt_buf_*
  bootmem: Move __alloc_memory_core_early() to nobootmem.c
  bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c
  bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c
  x86-64, NUMA: Seperate out numa_alloc_distance() from numa_set_distance()
  x86-64, NUMA: Add proper function comments to global functions
  x86-64, NUMA: Move NUMA emulation into numa_emulation.c
  x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file
  x86-64, NUMA: Do not scan two times for setup_node_bootmem()
  ...

Fix up conflicts in arch/x86/kernel/smpboot.c
2011-03-15 19:49:10 -07:00
Linus Torvalds
422e6c4bc4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits)
  tidy the trailing symlinks traversal up
  Turn resolution of trailing symlinks iterative everywhere
  simplify link_path_walk() tail
  Make trailing symlink resolution in path_lookupat() iterative
  update nd->inode in __do_follow_link() instead of after do_follow_link()
  pull handling of one pathname component into a helper
  fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH
  Allow passing O_PATH descriptors via SCM_RIGHTS datagrams
  readlinkat(), fchownat() and fstatat() with empty relative pathnames
  Allow O_PATH for symlinks
  New kind of open files - "location only".
  ext4: Copy fs UUID to superblock
  ext3: Copy fs UUID to superblock.
  vfs: Export file system uuid via /proc/<pid>/mountinfo
  unistd.h: Add new syscalls numbers to asm-generic
  x86: Add new syscalls for x86_64
  x86: Add new syscalls for x86_32
  fs: Remove i_nlink check from file system link callback
  fs: Don't allow to create hardlink for deleted file
  vfs: Add open by file handle support
  ...
2011-03-15 15:48:13 -07:00
James Morris
a002951c97 Merge branch 'next' into for-linus 2011-03-16 09:41:17 +11:00
Ingo Molnar
8460b3e5bc Merge commit 'v2.6.38' into x86/mm
Conflicts:
	arch/x86/mm/numa_64.c

Merge reason: Resolve the conflict, update the branch to .38.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-15 08:29:44 +01:00
Linus Torvalds
52d3c03675 Revert "oom: oom_kill_process: fix the child_points logic"
This reverts the parent commit.  I hate doing that, but it's generating
some discussion ("half of it is right"), and since I am planning on
doing the 2.6.38 release later today we can punt it to stable if
required. Let's not rock the boat right now.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-14 15:17:07 -07:00
Oleg Nesterov
dc1b83ab08 oom: oom_kill_process: fix the child_points logic
oom_kill_process() starts with victim_points == 0.  This means that
(most likely) any child has more points and can be killed erroneously.

Also, "children has a different mm" doesn't match the reality, we should
check child->mm != t->mm.  This check is not exactly correct if t->mm ==
NULL but this doesn't really matter, oom_kill_task() will kill them
anyway.

Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-14 13:38:35 -07:00
Hugh Dickins
2fbfac4e05 thp+memcg-numa: fix BUG at include/linux/mm.h:370!
THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-14 08:29:50 -07:00
Aneesh Kumar K.V
5fe0c23788 exportfs: Return the minimum required handle size
The exportfs encode handle function should return the minimum required
handle size. This helps user to find out the handle size by passing 0
handle size in the first step and then redoing to the call again with
the returned handle size value.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14 09:15:28 -04:00
Andrea Arcangeli
2da28bfd96 thp: fix page_referenced to modify mapcount/vm_flags only if page is found
When vmscan.c calls page_referenced(), if an anon page was created
before a process forked, rmap will search for it in both of the
processes, even though one of them might have since broken COW.

If the child process mlocks the vma where the COWed page belongs to,
page_referenced() running on the page mapped by the parent would lead to
*vm_flags getting VM_LOCKED set erroneously (leading to the references
on the parent page being ignored and evicting the parent page too
early).

*mapcount would also be decremented by page_referenced_one even if the
page wasn't found by page_check_address.

This also lets pmdp_clear_flush_young_notify() go ahead on a
pmd_trans_splitting() pmd.

We hold the page_table_lock so __split_huge_page_map() must wait the
pmdp_clear_flush_young_notify() to complete before it can modify the
pmd.  The pmd is also still mapped in userland so the young bit may
materialize through a tlb miss before split_huge_page_map runs.

This will provide a more accurate page_referenced() behavior during
split_huge_page().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel<riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-13 15:35:57 -07:00
Pekka Enberg
865d794d1f Merge branch 'slab/urgent' into slab/next 2011-03-11 18:11:19 +02:00
Pekka Enberg
c914955675 Merge branch 'slab/rcu' into slab/next
Conflicts:
	mm/slub.c
2011-03-11 18:10:45 +02:00
Lai Jiangshan
5bfe53a77e slab,rcu: don't assume the size of struct rcu_head
The size of struct rcu_head may be changed. When it becomes larger,
it may pollute the data after struct slab.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-11 18:06:35 +02:00
Lai Jiangshan
da9a638c6f slub,rcu: don't assume the size of struct rcu_head
The size of struct rcu_head may be changed. When it becomes larger,
it will pollute the page array.

We reserve some some bytes for struct rcu_head when a slab
is allocated in this situation.

Changed from V1:
	use VM_BUG_ON instead BUG_ON

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-11 18:06:34 +02:00
Lai Jiangshan
ab9a0f196f slub: automatically reserve bytes at the end of slab
There is no "struct" for slub's slab, it shares with struct page.
But struct page is very small, it is insufficient when we need
to add some metadata for slab.

So we add a field "reserved" to struct kmem_cache, when a slab
is allocated, kmem_cache->reserved bytes are automatically reserved
at the end of the slab for slab's metadata.

Changed from v1:
	Export the reserved field via sysfs

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-11 18:06:34 +02:00
Christoph Lameter
8a5ec0ba42 Lockless (and preemptless) fastpaths for slub
Use the this_cpu_cmpxchg_double functionality to implement a lockless
allocation algorithm on arches that support fast this_cpu_ops.

Each of the per cpu pointers is paired with a transaction id that ensures
that updates of the per cpu information can only occur in sequence on
a certain cpu.

A transaction id is a "long" integer that is comprised of an event number
and the cpu number. The event number is incremented for every change to the
per cpu state. This means that the cmpxchg instruction can verify for an
update that nothing interfered and that we are updating the percpu structure
for the processor where we picked up the information and that we are also
currently on that processor when we update the information.

This results in a significant decrease of the overhead in the fastpaths. It
also makes it easy to adopt the fast path for realtime kernels since this
is lockless and does not require the use of the current per cpu area
over the critical section. It is only important that the per cpu area is
current at the beginning of the critical section and at the end.

So there is no need even to disable preemption.

Test results show that the fastpath cycle count is reduced by up to ~ 40%
(alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
adds a few cycles.

Sadly this does nothing for the slowpath which is where the main issues with
performance in slub are but the best case performance rises significantly.
(For that see the more complex slub patches that require cmpxchg_double)

Kmalloc: alloc/free test

Before:

10000 times kmalloc(8)/kfree -> 134 cycles
10000 times kmalloc(16)/kfree -> 152 cycles
10000 times kmalloc(32)/kfree -> 144 cycles
10000 times kmalloc(64)/kfree -> 142 cycles
10000 times kmalloc(128)/kfree -> 142 cycles
10000 times kmalloc(256)/kfree -> 132 cycles
10000 times kmalloc(512)/kfree -> 132 cycles
10000 times kmalloc(1024)/kfree -> 135 cycles
10000 times kmalloc(2048)/kfree -> 135 cycles
10000 times kmalloc(4096)/kfree -> 135 cycles
10000 times kmalloc(8192)/kfree -> 144 cycles
10000 times kmalloc(16384)/kfree -> 754 cycles

After:

10000 times kmalloc(8)/kfree -> 78 cycles
10000 times kmalloc(16)/kfree -> 78 cycles
10000 times kmalloc(32)/kfree -> 82 cycles
10000 times kmalloc(64)/kfree -> 88 cycles
10000 times kmalloc(128)/kfree -> 79 cycles
10000 times kmalloc(256)/kfree -> 79 cycles
10000 times kmalloc(512)/kfree -> 85 cycles
10000 times kmalloc(1024)/kfree -> 82 cycles
10000 times kmalloc(2048)/kfree -> 82 cycles
10000 times kmalloc(4096)/kfree -> 85 cycles
10000 times kmalloc(8192)/kfree -> 82 cycles
10000 times kmalloc(16384)/kfree -> 706 cycles

Kmalloc: Repeatedly allocate then free test

Before:

10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles

After:

10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-11 17:42:49 +02:00
Christoph Lameter
d3f661d69a slub: Get rid of slab_free_hook_irq()
The following patch will make the fastpaths lockless and will no longer
require interrupts to be disabled. Calling the free hook with irq disabled
will no longer be possible.

Move the slab_free_hook_irq() logic into slab_free_hook. Only disable
interrupts if the features are selected that require callbacks with
interrupts off and reenable after calls have been made.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-11 17:42:49 +02:00
Jens Axboe
4c63f5646e Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core
Conflicts:
	block/blk-core.c
	block/blk-flush.c
	drivers/md/raid1.c
	drivers/md/raid10.c
	drivers/md/raid5.c
	fs/nilfs2/btnode.c
	fs/nilfs2/mdt.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:58:35 +01:00
Jens Axboe
721a9602e6 block: kill off REQ_UNPLUG
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:27 +01:00
Jens Axboe
5b417b1873 read-ahead: use plugging
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:26 +01:00
Jens Axboe
55602dd66f fs: make generic file read/write functions plug
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:26 +01:00
Jens Axboe
7eaceaccab block: remove per-queue plugging
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:07 +01:00
James Morris
fe3fa43039 Merge branch 'master' of git://git.infradead.org/users/eparis/selinux into next 2011-03-08 11:38:10 +11:00
Andi Kleen
5c4b4be3b6 mm: use correct numa policy node for transparent hugepages
Pass down the correct node for a transparent hugepage allocation.  Most
callers continue to use the current node, however the hugepaged daemon
now uses the previous node of the first to be collapsed page instead.
This ensures that khugepaged does not mess up local memory for an
existing process which uses local policy.

The choice of node is somewhat primitive currently: it just uses the
node of the first page in the pmd range.  An alternative would be to
look at multiple pages and use the most popular node.  I used the
simplest variant for now which should work well enough for the case of
all pages being on the same node.

[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-04 17:53:39 -08:00
Andi Kleen
19ee151e14 mm: preserve original node for transparent huge page copies
This makes a difference for LOCAL policy, where the node cannot be
determined from the policy itself, but has to be gotten from the original
page.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-04 17:53:39 -08:00
Andi Kleen
2f5f9486f8 mm: change alloc_pages_vma to pass down the policy node for local policy
Currently alloc_pages_vma() always uses the local node as policy node for
the LOCAL policy.  Pass this node down as an argument instead.

No behaviour change from this patch, but will be needed for followons.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-04 17:53:39 -08:00
Justin P. Mattock
ae0e47f02a Remove one to many n's in a word
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-03-01 15:47:58 +01:00
Mariusz Kozlowski
d71f606f68 slub: fix ksize() build error
mm/slub.c: In function 'ksize':
mm/slub.c:2728: error: implicit declaration of function 'slab_ksize'

slab_ksize() needs to go out of CONFIG_SLUB_DEBUG section.

Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-02-27 12:05:16 +02:00
Yinghai Lu
cc28989437 mm: Move early_node_map[] reverse scan helpers under HAVE_MEMBLOCK
Heiko found recent memblock change triggers these warnings on s390:

  mm/page_alloc.c:3623:22: warning: 'last_active_region_index_in_nid' defined but not used
  mm/page_alloc.c:3638:22: warning: 'previous_active_region_index_in_nid' defined but not used

Need to move those two function under HAVE_MEMBLOCK with its only
user, find_memory_core_early().

-tj: Minor updates to description.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-26 13:05:43 +01:00
Hugh Dickins
e5598f8bf5 memcg: more mem_cgroup_uncharge() batching
It seems odd that truncate_inode_pages_range(), called not only when
truncating but also when evicting inodes, has mem_cgroup_uncharge_start
and _end() batching in its second loop to clear up a few leftovers, but
not in its first loop that does almost all the work: add them there too.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-25 15:07:37 -08:00
Andi Kleen
8eac563c1c thp: fix interleaving for transparent hugepages
The THP code didn't pass the correct interleaving shift to the memory
policy code.  Fix this here by adjusting for the order.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-25 15:07:37 -08:00
Namhyung Kim
29723fccc8 mm: fix dubious code in __count_immobile_pages()
When pfn_valid_within() failed 'iter' was incremented twice.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-25 15:07:37 -08:00
Mel Gorman
2876592f23 mm: vmscan: stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
should_continue_reclaim() for reclaim/compaction allows scanning to
continue even if pages are not being reclaimed until the full list is
scanned.  In terms of allocation success, this makes sense but potentially
it introduces unwanted latency for high-order allocations such as
transparent hugepages and network jumbo frames that would prefer to fail
the allocation attempt and fallback to order-0 pages.  Worse, there is a
potential that the full LRU scan will clear all the young bits, distort
page aging information and potentially push pages into swap that would
have otherwise remained resident.

This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered.  For allocations such as
hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
LRU list may still be scanned.

Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
is not set so the following avoids the gfp_mask being examined:

        if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
                return false;

A tool was developed based on ftrace that tracked the latency of
high-order allocations while transparent hugepage support was enabled and
three benchmarks were run.  The "fix-infinite" figures are 2.6.38-rc4 with
Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
applied.

  STREAM Highorder Allocation Latency Statistics
                 fix-infinite     break-early
  1 :: Count            10298           10229
  1 :: Min             0.4560          0.4640
  1 :: Mean            1.0589          1.0183
  1 :: Max            14.5990         11.7510
  1 :: Stddev          0.5208          0.4719
  2 :: Count                2               1
  2 :: Min             1.8610          3.7240
  2 :: Mean            3.4325          3.7240
  2 :: Max             5.0040          3.7240
  2 :: Stddev          1.5715          0.0000
  9 :: Count           111696          111694
  9 :: Min             0.5230          0.4110
  9 :: Mean           10.5831         10.5718
  9 :: Max            38.4480         43.2900
  9 :: Stddev          1.1147          1.1325

Mean time for order-1 allocations is reduced.  order-2 looks increased but
with so few allocations, it's not particularly significant.  THP mean
allocation latency is also reduced.  That said, allocation time varies so
significantly that the reductions are within noise.

Max allocation time is reduced by a significant amount for low-order
allocations but reduced for THP allocations which presumably are now
breaking before reclaim has done enough work.

  SysBench Highorder Allocation Latency Statistics
                 fix-infinite     break-early
  1 :: Count            15745           15677
  1 :: Min             0.4250          0.4550
  1 :: Mean            1.1023          1.0810
  1 :: Max            14.4590         10.8220
  1 :: Stddev          0.5117          0.5100
  2 :: Count                1               1
  2 :: Min             3.0040          2.1530
  2 :: Mean            3.0040          2.1530
  2 :: Max             3.0040          2.1530
  2 :: Stddev          0.0000          0.0000
  9 :: Count             2017            1931
  9 :: Min             0.4980          0.7480
  9 :: Mean           10.4717         10.3840
  9 :: Max            24.9460         26.2500
  9 :: Stddev          1.1726          1.1966

Again, mean time for order-1 allocations is reduced while order-2
allocations are too few to draw conclusions from.  The mean time for THP
allocations is also slightly reduced albeit the reductions are within
varianes.

Once again, our maximum allocation time is significantly reduced for
low-order allocations and slightly increased for THP allocations.

  Anon stream mmap reference Highorder Allocation Latency Statistics
  1 :: Count             1376            1790
  1 :: Min             0.4940          0.5010
  1 :: Mean            1.0289          0.9732
  1 :: Max             6.2670          4.2540
  1 :: Stddev          0.4142          0.2785
  2 :: Count                1               -
  2 :: Min             1.9060               -
  2 :: Mean            1.9060               -
  2 :: Max             1.9060               -
  2 :: Stddev          0.0000               -
  9 :: Count            11266           11257
  9 :: Min             0.4990          0.4940
  9 :: Mean        27250.4669      24256.1919
  9 :: Max      11439211.0000    6008885.0000
  9 :: Stddev     226427.4624     186298.1430

This benchmark creates one thread per CPU which references an amount of
anonymous memory 1.5 times the size of physical RAM.  This pounds swap
quite heavily and is intended to exercise THP a bit.

Mean allocation time for order-1 is reduced as before.  It's also reduced
for THP allocations but the variations here are pretty massive due to
swap.  As before, maximum allocation times are significantly reduced.

Overall, the patch reduces the mean and maximum allocation latencies for
the smaller high-order allocations.  This was with Slab configured so it
would be expected to be more significant with Slub which uses these size
allocations more aggressively.

The mean allocation times for THP allocations are also slightly reduced.
The maximum latency was slightly increased as predicted by the comments
due to reclaim/compaction breaking early.  However, workloads care more
about the latency of lower-order allocations than THP so it's an
acceptable trade-off.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-25 15:07:36 -08:00
Greg Thelen
a879bf582d mm: grab rcu read lock in move_pages()
The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to
prevent free_pid() from reclaiming the pid.

Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages()
with:

  CONFIG_LOCKUP_DETECTOR=y
  CONFIG_PREEMPT=y
  CONFIG_LOCKDEP=y
  CONFIG_PROVE_LOCKING=y
  CONFIG_PROVE_RCU=y

Previously, migrate_pages() went through a similar transformation
replacing usage of tasklist_lock with rcu read lock:

  commit 55cfaa3cbd
  Author: Zeng Zhaoming <zengzm.kernel@gmail.com>
  Date:   Thu Dec 2 14:31:13 2010 -0800

      mm/mempolicy.c: add rcu read lock to protect pid structure

  commit 1e50df39f6
  Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
  Date:   Thu Jan 13 15:46:14 2011 -0800

      mempolicy: remove tasklist_lock from migrate_pages

Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Zeng Zhaoming <zengzm.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-25 15:07:36 -08:00
Miklos Szeredi
8074b26f67 mm: fix refcounting in swapon
Grab a reference to bdev before calling blkdev_get(), which expects
the refcount to be already incremented and either returns success or
decrements the refcount and returns an error.

The bug was introduced by e525fd89 (block: make blkdev_get/put()
handle exclusive access), which didn't take into account this behavior
of blkdev_get().

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-24 08:55:01 -08:00
Yinghai Lu
8bc1f91e1f bootmem: Move __alloc_memory_core_early() to nobootmem.c
Now that bootmem.c and nobootmem.c are separate, there's no reason to
define __alloc_memory_core_early(), which is used only by nobootmem,
inside #ifdef in page_alloc.c.  Move it to nobootmem.c and make it
static.

This patch doesn't introduce any behavior change.

-tj: Updated commit description.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-24 14:43:06 +01:00
Yinghai Lu
e782ab421b bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c
Now that bootmem.c and nobootmem.c are separate, it's cleaner to
define contig_page_data in each file than in page_alloc.c with #ifdef.
Move it.

This patch doesn't introduce any behavior change.

-v2: According to Andrew, fixed the struct layout.
-tj: Updated commit description.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-24 14:43:06 +01:00
Yinghai Lu
0932587328 bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c
mm/bootmem.c contained code paths for both bootmem and no bootmem
configurations.  They implement about the same set of APIs in
different ways and as a result bootmem.c contains massive amount of
#ifdef CONFIG_NO_BOOTMEM.

Separate out CONFIG_NO_BOOTMEM code into mm/nobootmem.c.  As the
common part is relatively small, duplicate them in nobootmem.c instead
of creating a common file or ifdef'ing in bootmem.c.

The followings are duplicated.

* {min|max}_low_pfn, max_pfn, saved_max_pfn
* free_bootmem_late()
* ___alloc_bootmem()
* __alloc_bootmem_low()

The followings are applicable only to nobootmem and moved verbatim.

* __free_pages_memory()
* free_all_memory_core_early()

The followings are not applicable to nobootmem and omitted in
nobootmem.c.

* reserve_bootmem_node()
* reserve_bootmem()

The rest split function bodies according to CONFIG_NO_BOOTMEM.

Makefile is updated so that only either bootmem.c or nobootmem.c is
built according to CONFIG_NO_BOOTMEM.

This patch doesn't introduce any behavior change.

-tj: Rewrote commit description.

Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-24 14:43:05 +01:00
Hugh Dickins
a3e8cc643d mm: fix possible cause of a page_mapped BUG
Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
a hole with madvise(,, MADV_REMOVE).  That path is under mutex, and
cannot be explained by lack of serialization in unmap_mapping_range().

Reviewing the code, I found one place where vm_truncate_count handling
should have been updated, when I switched at the last minute from one
way of managing the restart_addr to another: mremap move changes the
virtual addresses, so it ought to adjust the restart_addr.

But rather than exporting the notion of restart_addr from memory.c, or
converting to restart_pgoff throughout, simply reset vm_truncate_count
to 0 to force a rescan if mremap move races with preempted truncation.

We have no confirmation that this fixes Robert's BUG,
but it is a fix that's worth making anyway.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-23 21:55:06 -08:00
Miklos Szeredi
2aa15890f3 mm: prevent concurrent unmap_mapping_range() on the same inode
Michael Leun reported that running parallel opens on a fuse filesystem
can trigger a "kernel BUG at mm/truncate.c:475"

Gurudas Pai reported the same bug on NFS.

The reason is, unmap_mapping_range() is not prepared for more than
one concurrent invocation per inode.  For example:

  thread1: going through a big range, stops in the middle of a vma and
     stores the restart address in vm_truncate_count.

  thread2: comes in with a small (e.g. single page) unmap request on
     the same vma, somewhere before restart_address, finds that the
     vma was already unmapped up to the restart address and happily
     returns without doing anything.

Another scenario would be two big unmap requests, both having to
restart the unmapping and each one setting vm_truncate_count to its
own value.  This could go on forever without any of them being able to
finish.

Truncate and hole punching already serialize with i_mutex.  Other
callers of unmap_mapping_range() do not, and it's difficult to get
i_mutex protection for all callers.  In particular ->d_revalidate(),
which calls invalidate_inode_pages2_range() in fuse, may be called
with or without i_mutex.

This patch adds a new mutex to 'struct address_space' to prevent
running multiple concurrent unmap_mapping_range() on the same mapping.

[ We'll hopefully get rid of all this with the upcoming mm
  preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
  lockbreak" patch in particular.  But that is for 2.6.39 ]

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Michael Leun <lkml20101129@newton.leun.net>
Reported-by: Gurudas Pai <gurudas.pai@oracle.com>
Tested-by: Gurudas Pai <gurudas.pai@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-23 19:52:52 -08:00
Eric Dumazet
b3d41885d9 slub: fix kmemcheck calls to match ksize() hints
Recent use of ksize() in network stack (commit ca44ac38 : net: don't
reallocate skb->head unless the current one hasn't the needed extra size
or is shared) triggers kmemcheck warnings, because ksize() can return
more space than kmemcheck is aware of.

Pekka Enberg noticed SLAB+kmemcheck is doing the right thing, while SLUB
+kmemcheck doesnt.

Bugzilla reference #27212

Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Suggested-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Changli Gao <xiaosuo@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-02-23 11:59:30 +02:00
Ryota Ozaki
a335b2e173 mm: Fix out-of-date comments which refers non-existent functions
do_file_page and do_no_page don't exist anymore, but some comments
still refers them. The patch fixes them by replacing them with
existing ones.

Signed-off-by: Ryota Ozaki <ozaki.ryota@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-02-17 16:54:39 +01:00
Andrea Arcangeli
a7d6e4ecdb thp: prevent hugepages during args/env copying into the user stack
Transparent hugepages can only be created if rmap is fully
functional. So we must prevent hugepages to be created while
is_vma_temporary_stack() is true.

This also optmizes away some harmless but unnecessary setting of
khugepaged_scan.address and it switches some BUG_ON to VM_BUG_ON.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-15 15:21:11 -08:00
Pekka Enberg
3ff84a7f36 Revert "slab: Fix missing DEBUG_SLAB last user"
This reverts commit 5c5e3b33b7.

The commit breaks ARM thusly:

| Mount-cache hash table entries: 512
| slab error in verify_redzone_free(): cache `idr_layer_cache': memory outside object was overwritten
| Backtrace:
| [<c0227088>] (dump_backtrace+0x0/0x110) from [<c0431afc>] (dump_stack+0x18/0x1c)
| [<c0431ae4>] (dump_stack+0x0/0x1c) from [<c0293304>] (__slab_error+0x28/0x30)
| [<c02932dc>] (__slab_error+0x0/0x30) from [<c0293a74>] (cache_free_debugcheck+0x1c0/0x2b8)
| [<c02938b4>] (cache_free_debugcheck+0x0/0x2b8) from [<c0293f78>] (kmem_cache_free+0x3c/0xc0)
| [<c0293f3c>] (kmem_cache_free+0x0/0xc0) from [<c032b1c8>] (ida_get_new_above+0x19c/0x1c0)
| [<c032b02c>] (ida_get_new_above+0x0/0x1c0) from [<c02af7ec>] (alloc_vfsmnt+0x54/0x144)
| [<c02af798>] (alloc_vfsmnt+0x0/0x144) from [<c0299830>] (vfs_kern_mount+0x30/0xec)
| [<c0299800>] (vfs_kern_mount+0x0/0xec) from [<c0299908>] (kern_mount_data+0x1c/0x20)
| [<c02998ec>] (kern_mount_data+0x0/0x20) from [<c02146c4>] (sysfs_init+0x68/0xc8)
| [<c021465c>] (sysfs_init+0x0/0xc8) from [<c02137d4>] (mnt_init+0x90/0x1b0)
| [<c0213744>] (mnt_init+0x0/0x1b0) from [<c0213388>] (vfs_caches_init+0x100/0x140)
| [<c0213288>] (vfs_caches_init+0x0/0x140) from [<c0208c0c>] (start_kernel+0x2e8/0x368)
| [<c0208924>] (start_kernel+0x0/0x368) from [<c0208034>] (__enable_mmu+0x0/0x2c)
| c0113268: redzone 1:0xd84156c5c032b3ac, redzone 2:0xd84156c5635688c0.
| slab error in cache_alloc_debugcheck_after(): cache `idr_layer_cache': double free, or memory outside object was overwritten
| ...
| c011307c: redzone 1:0x9f91102ffffffff, redzone 2:0x9f911029d74e35b
| slab: Internal list corruption detected in cache 'idr_layer_cache'(24), slabp c0113000(16). Hexdump:
|
| 000: 20 4f 10 c0 20 4f 10 c0 7c 00 00 00 7c 30 11 c0
| 010: 10 00 00 00 10 00 00 00 00 00 c9 17 fe ff ff ff
| 020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
| 030: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
| 040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
| 050: fe ff ff ff fe ff ff ff fe ff ff ff 11 00 00 00
| 060: 12 00 00 00 13 00 00 00 14 00 00 00 15 00 00 00
| 070: 16 00 00 00 17 00 00 00 c0 88 56 63
| kernel BUG at /home/rmk/git/linux-2.6-rmk/mm/slab.c:2928!

Reference: https://lkml.org/lkml/2011/2/7/238
Cc: <stable@kernel.org> # 2.6.35.y and later
Reported-and-analyzed-by: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-02-14 17:46:21 +02:00
Ingo Molnar
d2137d5af4 Merge branch 'linus' into x86/bootmem
Conflicts:
	arch/x86/mm/numa_64.c

Merge reason: fix the conflict, update to latest -rc and pick up this
              dependent fix from Yinghai:

  e6d2e2b2b1: memblock: don't adjust size in memblock_find_base()

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-14 11:55:18 +01:00
KAMEZAWA Hiroyuki
678ff896a3 memcg: fix leak of accounting at failure path of hugepage collapsing
mem_cgroup_uncharge_page() should be called in all failure cases after
mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()

 [ 4209.076861] BUG: Bad page state in process khugepaged  pfn:1e9800
 [ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping:          (null) index:0x2800
 [ 4209.078674] page flags: 0x40000000004000(head)
 [ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
 [ 4209.082177] (/A)
 [ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
 [ 4209.083412] Call Trace:
 [ 4209.083678]  [<ffffffff810f4454>] ? bad_page+0xe4/0x140
 [ 4209.084240]  [<ffffffff810f53e6>] ? free_pages_prepare+0xd6/0x120
 [ 4209.084837]  [<ffffffff8155621d>] ? rwsem_down_failed_common+0xbd/0x150
 [ 4209.085509]  [<ffffffff810f5462>] ? __free_pages_ok+0x32/0xe0
 [ 4209.086110]  [<ffffffff810f552b>] ? free_compound_page+0x1b/0x20
 [ 4209.086699]  [<ffffffff810fad6c>] ? __put_compound_page+0x1c/0x30
 [ 4209.087333]  [<ffffffff810fae1d>] ? put_compound_page+0x4d/0x200
 [ 4209.087935]  [<ffffffff810fb015>] ? put_page+0x45/0x50
 [ 4209.097361]  [<ffffffff8113f779>] ? khugepaged+0x9e9/0x1430
 [ 4209.098364]  [<ffffffff8107c870>] ? autoremove_wake_function+0x0/0x40
 [ 4209.099121]  [<ffffffff8113ed90>] ? khugepaged+0x0/0x1430
 [ 4209.099780]  [<ffffffff8107c236>] ? kthread+0x96/0xa0
 [ 4209.100452]  [<ffffffff8100dda4>] ? kernel_thread_helper+0x4/0x10
 [ 4209.101214]  [<ffffffff8107c1a0>] ? kthread+0x0/0xa0
 [ 4209.101842]  [<ffffffff8100dda0>] ? kernel_thread_helper+0x0/0x10

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-11 16:12:20 -08:00
Johannes Weiner
f0fdc5e8e6 vmscan: fix zone shrinking exit when scan work is done
Commit 3e7d344970 ("mm: vmscan: reclaim order-0 and use compaction
instead of lumpy reclaim") introduced an indefinite loop in
shrink_zone().

It meant to break out of this loop when no pages had been reclaimed and
not a single page was even scanned.  The way it would detect the latter
is by taking a snapshot of sc->nr_scanned at the beginning of the
function and comparing it against the new sc->nr_scanned after the scan
loop.  But it would re-iterate without updating that snapshot, looping
forever if sc->nr_scanned changed at least once since shrink_zone() was
invoked.

This is not the sole condition that would exit that loop, but it
requires other processes to change the zone state, as the reclaimer that
is stuck obviously can not anymore.

This is only happening for higher-order allocations, where reclaim is
run back to back with compaction.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Kent Overstreet<kent.overstreet@gmail.com>
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-11 16:12:20 -08:00
Michel Lespinasse
419d8c96db mlock: do not munlock pages in __do_fault()
If the page is going to be written to, __do_page needs to break COW.

However, the old page (before breaking COW) was never mapped mapped into
the current pte (__do_fault is only called when the pte is not present),
so vmscan can't have marked the old page as PageMlocked due to being
mapped in __do_fault's VMA.  Therefore, __do_fault() does not need to
worry about clearing PageMlocked() on the old page.

Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-11 16:12:20 -08:00
Michel Lespinasse
e15f8c01af mlock: fix race when munlocking pages in do_wp_page()
vmscan can lazily find pages that are mapped within VM_LOCKED vmas, and
set the PageMlocked bit on these pages, transfering them onto the
unevictable list.  When do_wp_page() breaks COW within a VM_LOCKED vma,
it may need to clear PageMlocked on the old page and set it on the new
page instead.

This change fixes an issue where do_wp_page() was clearing PageMlocked
on the old page while the pte was still pointing to it (as well as
rmap).  Therefore, we were not protected against vmscan immediately
transfering the old page back onto the unevictable list.  This could
cause pages to get stranded there forever.

I propose to move the corresponding code to the end of do_wp_page(),
after the pte (and rmap) have been pointed to the new page.
Additionally, we can use munlock_vma_page() instead of
clear_page_mlock(), so that the old page stays mlocked if there are
still other VM_LOCKED vmas mapping it.

Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-11 16:12:20 -08:00
Yinghai Lu
e6d2e2b2b1 memblock: don't adjust size in memblock_find_base()
While applying patch to use memblock to find aperture for 64bit x86.
Ingo found system with 1g + force_iommu

> No AGP bridge found
> Node 0: aperture @ 38000000 size 32 MB
> Aperture pointing to e820 RAM. Ignoring.
> Your BIOS doesn't leave a aperture memory hole
> Please enable the IOMMU option in the BIOS setup
> This costs you 64 MB of RAM
> Cannot allocate aperture memory hole (0,65536K)

the corresponding code:

	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
	if (addr == MEMBLOCK_ERROR || addr + aper_size > 0xffffffff) {
		printk(KERN_ERR
			"Cannot allocate aperture memory hole (%lx,%uK)\n",
				addr, aper_size>>10);
		return 0;
	}
	memblock_x86_reserve_range(addr, addr + aper_size, "aperture64")

fails because memblock core code align the size with 512M.  That could
make size way too big.

So don't align the size in that case.

actually __memblock_alloc_base, the another caller already align that
before calling that function.

BTW. x86 does not use __memblock_alloc_base...

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-11 16:12:20 -08:00
KAMEZAWA Hiroyuki
3751d60430 memcg: fix event counting breakage from recent THP update
Changes in e401f1761 ("memcg: modify accounting function for supporting
THP better") adds nr_pages to support multiple page size in
memory_cgroup_charge_statistics.

But counting the number of event nees abs(nr_pages) for increasing
counters.  This patch fixes event counting.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:19 -08:00
Johannes Weiner
8493ae439f memcg: never OOM when charging huge pages
Huge page coverage should obviously have less priority than the continued
execution of a process.

Never kill a process when charging it a huge page fails.  Instead, give up
after the first failed reclaim attempt and fall back to regular pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:19 -08:00
Johannes Weiner
19942822df memcg: prevent endless loop when charging huge pages to near-limit group
If reclaim after a failed charging was unsuccessful, the limits are
checked again, just in case they settled by means of other tasks.

This is all fine as long as every charge is of size PAGE_SIZE, because in
that case, being below the limit means having at least PAGE_SIZE bytes
available.

But with transparent huge pages, we may end up in an endless loop where
charging and reclaim fail, but we keep going because the limits are not
yet exceeded, although not allowing for a huge page.

Fix this up by explicitely checking for enough room, not just whether we
are within limits.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:19 -08:00
Johannes Weiner
9221edb712 memcg: prevent endless loop when charging huge pages
The charging code can encounter a charge size that is bigger than a
regular page in two situations: one is a batched charge to fill the
per-cpu stocks, the other is a huge page charge.

This code is distributed over two functions, however, and only the outer
one is aware of huge pages.  In case the charging fails, the inner
function will tell the outer function to retry if the charge size is
bigger than regular pages--assuming batched charging is the only case.
And the outer function will retry forever charging a huge page.

This patch makes sure the inner function can distinguish between batch
charging and a single huge page charge.  It will only signal another
attempt if batch charging failed, and go into regular reclaim when it is
called on behalf of a huge page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:19 -08:00
Jin Dongming
af241a0834 thp: fix unsuitable behavior for hwpoisoned tail page
When a tail page of THP is poisoned, memory-failure will do nothing except
setting PG_hwpoison, while the expected behavior is that the process, who
is using the poisoned tail page, should be killed.

The above problem is caused by lru check of the poisoned tail page of THP.
Because PG_lru flag is only set on the head page of THP, the check always
consider the poisoned tail page as NON lru page.

So the lru check for the tail page of THP should be avoided, as like as
hugetlb.

This patch adds !PageTransCompound() before lru check for THP, because of
the check (!PageHuge() && !PageTransCompound()) the whole branch could be
optimized away at build time when both hugetlbfs and THP are set with "N"
(or in archs not supporting either of those).

[akpm@linux-foundation.org: fix unrelated typo in shake_page() comment]
Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:19 -08:00
Jin Dongming
a6d30dddae thp: fix the wrong reported address of hwpoisoned hugepages
When the tail page of THP is poisoned, the head page will be poisoned too.
 And the wrong address, address of head page, will be sent with sigbus
always.

So when the poisoned page is used by Guest OS which is running on KVM,
after the address changing(hva->gpa) by qemu, the unexpected process on
Guest OS will be killed by sigbus.

What we expected is that the process using the poisoned tail page could be
killed on Guest OS, but not that the process using the healthy head page
is killed.

Since it is not good to poison the healthy page, avoid poisoning other
than the page which is really poisoned.
  (While we poison all pages in a huge page in case of hugetlb,
   we can do this for THP thanks to split_huge_page().)

Here we fix two parts:
  1. Isolate the poisoned page only to make sure
     the reported address is the address of poisoned page.
  2. make the poisoned page work as the poisoned regular page.

[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:19 -08:00
Jin Dongming
efeda7a41e thp: fix splitting of hwpoisoned hugepages
The poisoned THP is now split with split_huge_page() in
collect_procs_anon().  If kmalloc() is failed in collect_procs(),
split_huge_page() could not be called.  And the work after
split_huge_page() for collecting the processes using poisoned page will
not be done, too.  So the processes using the poisoned page could not be
killed.

The condition becomes worse when CONFIG_DEBUG_VM == "Y".  Because the
poisoned THP could not be split, system panic will be caused by
VM_BUG_ON(PageTransHuge(page)) in try_to_unmap().

This patch does:
  1. move split_huge_page() to the place before collect_procs().
     This can be sure the failure of splitting THP is caused by itself.
  2. when splitting THP is failed, stop the operations after it.
     This can avoid unexpected system panic or non sense works.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:19 -08:00
Minchan Kim
48db54ee2f mm/migration: fix page corruption during hugepage migration
If migrate_huge_page by memory-failure fails , it calls put_page in itself
to decrease page reference and caller of migrate_huge_page also calls
putback_lru_pages.  It can do double free of page so it can make page
corruption on page holder.

In addtion, clean of pages on caller is consistent behavior with
migrate_pages by cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED
counting").

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:18 -08:00
Andrea Arcangeli
57fc4a5ee3 mm: when migrate_pages returns 0, all pages must have been released
In some cases migrate_pages could return zero while still leaving a few
pages in the pagelist (and some caller wouldn't notice it has to call
putback_lru_pages after commit cf608ac19c ("mm: compaction: fix
COMPACTPAGEFAILED counting")).

Add one missing putback_lru_pages not added by commit cf608ac19c ("mm:
compaction: fix COMPACTPAGEFAILED counting").

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:18 -08:00
Michal Hocko
552b372ba9 memsw: deprecate noswapaccount kernel parameter and schedule it for removal
noswapaccount couldn't be used to control memsw for both on/off cases so
we have added swapaccount[=0|1] parameter.  This way we can turn the
feature in two ways noswapaccount resp.  swapaccount=0.  We have kept the
original noswapaccount but I think we should remove it after some time as
it just makes more command line parameters without any advantages and also
the code to handle parameters is uglier if we want both parameters.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Requested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:18 -08:00
Michal Hocko
fceda1bf49 memsw: handle swapaccount kernel parameter correctly
__setup based kernel command line parameters handlers which are handled in
obsolete_checksetup are provided with the parameter value including =
(more precisely everything right after the parameter name).

This means that the current implementation of swapaccount[=1|0] doesn't
work at all because if there is a value for the parameter then we are
testing for "0" resp.  "1" but we are getting "=0" resp.  "=1" and if
there is no parameter value we are getting an empty string rather than
NULL.

The original noswapccount parameter, which doesn't care about the value,
works correctly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:18 -08:00
Michel Lespinasse
fdf4c587a7 mlock: operate on any regions with protection != PROT_NONE
As Tao Ma noticed, change 5ecfda0 breaks blktrace. This is because
blktrace mmaps a file with PROT_WRITE permissions but without PROT_READ,
so my attempt to not unnecessarity break COW during mlock ended up
causing mlock to fail with a permission problem.

I am proposing to let mlock ignore vma protection in all cases except
PROT_NONE. In particular, mlock should not fail for PROT_WRITE regions
(as in the blktrace case, which broke at 5ecfda0) or for PROT_EXEC
regions (which seem to me like they were always broken).

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 10:20:50 +11:00
Eric Paris
2a7dba391e fs/vfs/security: pass last path component to LSM on inode creation
SELinux would like to implement a new labeling behavior of newly created
inodes.  We currently label new inodes based on the parent and the creating
process.  This new behavior would also take into account the name of the
new object when deciding the new label.  This is not the (supposed) full path,
just the last component of the path.

This is very useful because creating /etc/shadow is different than creating
/etc/passwd but the kernel hooks are unable to differentiate these
operations.  We currently require that userspace realize it is doing some
difficult operation like that and than userspace jumps through SELinux hoops
to get things set up correctly.  This patch does not implement new
behavior, that is obviously contained in a seperate SELinux patch, but it
does pass the needed name down to the correct LSM hook.  If no such name
exists it is fine to pass NULL.

Signed-off-by: Eric Paris <eparis@redhat.com>
2011-02-01 11:12:29 -05:00
Linus Torvalds
4fda116852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
  kmemleak: Allow kmemleak metadata allocations to fail
  kmemleak: remove memset by using kzalloc
2011-01-31 12:55:38 +10:00
Catalin Marinas
6ae4bd1f0b kmemleak: Allow kmemleak metadata allocations to fail
This patch adds __GFP_NORETRY and __GFP_NOMEMALLOC flags to the kmemleak
metadata allocations so that it has a smaller effect on the users of the
kernel slab allocator. Since kmemleak allocations can now fail more
often, this patch also reduces the verbosity by passing __GFP_NOWARN and
not dumping the stack trace when a kmemleak allocation fails.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ted Ts'o <tytso@mit.edu>
2011-01-27 18:32:06 +00:00
Jesper Juhl
0a08739e81 kmemleak: remove memset by using kzalloc
We don't need to memset if we just use kzalloc() rather than kmalloc() in
kmemleak_test_init().

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2011-01-27 18:31:51 +00:00
KAMEZAWA Hiroyuki
52dbb90509 memcg: fix race at move_parent around compound_order()
A fix up mem_cgroup_move_parent() which use compound_order() in
asynchronous manner.  This compound_order() may return unknown value
because we don't take lock.  Use PageTransHuge() and HPAGE_SIZE instead
of it.

Also clean up for mem_cgroup_move_parent().
 - remove unnecessary initialization of local variable.
 - rename charge_size -> page_size
 - remove unnecessary (wrong) comment.
 - added a comment about THP.

Note:
 Current design take compound_page_lock() in caller of move_account().
 This should be revisited when we implement direct move_task of hugepage
 without splitting.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:04 +10:00
KAMEZAWA Hiroyuki
3d37c4a919 memcg: bugfix check mem_cgroup_disabled() at split fixup
mem_cgroup_disabled() should be checked at splitting.  If disabled, no
heavy work is necesary.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:03 +10:00
KAMEZAWA Hiroyuki
01c88e2d6b memcg: fix account leak at failure of memsw acconting
Commit 4b53433468 ("memcg: clean up try_charge main loop") removes a
cancel of charge at case: memory charge-> success.  mem+swap charge->
failure.

This leaks usage of memory.  Fix it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: <stable@kernel.org>	[2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:03 +10:00
Minchan Kim
28bd65781c mm: migration: clarify migrate_pages() comment
Callers of migrate_pages should putback_lru_pages to return pages
isolated to LRU or free list.  Now comment is rather confusing.  It says
caller always have to call it.

It is more clear to point out that the caller has to call it if
migrate_pages's return value isn't zero.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:02 +10:00
Andrea Arcangeli
33a938774f mm: compaction: don't depend on HUGETLB_PAGE
Commit 5d6892407 ("thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE
enabled") causes this warning during the configuration process:

  warning: (TRANSPARENT_HUGEPAGE) selects COMPACTION which has unmet
  direct dependencies (EXPERIMENTAL && HUGETLB_PAGE && MMU)

COMPACTION doesn't depend on HUGETLB_PAGE, it doesn't depend on THP
either, it is also useful for regular alloc_pages(order > 0) including
the very kernel stack during fork (THREAD_ORDER = 1).  It's always
better to enable COMPACTION.

The warning should be an error because we would end up with MIGRATION
not selected, and COMPACTION wouldn't work without migration (despite it
seems to build with an inline migrate_pages returning -ENOSYS).

I'd also like to remove EXPERIMENTAL: compaction has been in the kernel
for some releases (for full safety the default remains disabled which I
think is enough).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Luca Tettamanti <kronos.it@gmail.com>
Tested-by: Luca Tettamanti <kronos.it@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:02 +10:00
Jesper Juhl
8dba474f03 mm/memcontrol.c: fix uninitialized variable use in mem_cgroup_move_parent()
In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps
to the 'put_back' label

  	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
  	if (ret || !parent)
  		goto put_back;

where we'll

  	if (charge > PAGE_SIZE)
  		compound_unlock_irqrestore(page, flags);

but, we have not assigned anything to 'flags' at this point, nor have we
called 'compound_lock_irqsave()' (which is what sets 'flags').  The
'put_back' label should be moved below the call to
compound_unlock_irqrestore() as per this patch.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:01 +10:00
David Rientjes
2ff754fa8f mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator
Commit 0e093d9976 ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") uncovered a livelock in the page
allocator that resulted in tasks infinitely looping trying to find
memory and kswapd running at 100% cpu.

The issue occurs because drain_all_pages() is called immediately
following direct reclaim when no memory is freed and try_to_free_pages()
returns non-zero because all zones in the zonelist do not have their
all_unreclaimable flag set.

When draining the per-cpu pagesets back to the buddy allocator for each
zone, the zone->pages_scanned counter is cleared to avoid erroneously
setting zone->all_unreclaimable later.  The problem is that no pages may
actually be drained and, thus, the unreclaimable logic never fails
direct reclaim so the oom killer may be invoked.

This apparently only manifested after wait_iff_congested() was
introduced and the zone was full of anonymous memory that would not
congest the backing store.  The page allocator would infinitely loop if
there were no other tasks waiting to be scheduled and clear
zone->pages_scanned because of drain_all_pages() as the result of this
change before kswapd could scan enough pages to trigger the reclaim
logic.  Additionally, with every loop of the page allocator and in the
reclaim path, kswapd would be kicked and would end up running at 100%
cpu.  In this scenario, current and kswapd are all running continuously
with kswapd incrementing zone->pages_scanned and current clearing it.

The problem is even more pronounced when current swaps some of its
memory to swap cache and the reclaimable logic then considers all active
anonymous memory in the all_unreclaimable logic, which requires a much
higher zone->pages_scanned value for try_to_free_pages() to return zero
that is never attainable in this scenario.

Before wait_iff_congested(), the page allocator would incur an
unconditional timeout and allow kswapd to elevate zone->pages_scanned to
a level that the oom killer would be called the next time it loops.

The fix is to only attempt to drain pcp pages if there is actually a
quantity to be drained.  The unconditional clearing of
zone->pages_scanned in free_pcppages_bulk() need not be changed since
other callers already ensure that draining will occur.  This patch
ensures that free_pcppages_bulk() will actually free memory before
calling into it from drain_all_pages() so zone->pages_scanned is only
cleared if appropriate.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:01 +10:00
David Rientjes
f33261d75b mm: fix deferred congestion timeout if preferred zone is not allowed
Before 0e093d9976 ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone"), preferred_zone was only used for NUMA
statistics, to determine the zoneidx from which to allocate from given
the type requested, and whether to utilize memory compaction.

wait_iff_congested(), though, uses preferred_zone to determine if the
congestion wait should be deferred because its dirty pages are backed by
a congested bdi.  This incorrectly defers the timeout and busy loops in
the page allocator with various cond_resched() calls if preferred_zone
is not allowed in the current context, usually consuming 100% of a cpu.

This patch ensures preferred_zone is an allowed zone in the fastpath
depending on whether current is constrained by its cpuset or nodes in
its mempolicy (when the nodemask passed is non-NULL).  This is correct
since the fastpath allocation always passes ALLOC_CPUSET when trying to
allocate memory.  In the slowpath, this patch resets preferred_zone to
the first zone of the allowed type when the allocation is not
constrained by current's cpuset, i.e.  it does not pass ALLOC_CPUSET.

This patch also ensures preferred_zone is from the set of allowed nodes
when called from within direct reclaim since allocations are always
constrained by cpusets in this context (it is blockable).

Both of these uses of cpuset_current_mems_allowed are protected by
get_mems_allowed().

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:00 +10:00
Andrew Morton
f95ba941d1 mm/pgtable-generic.c: fix CONFIG_SWAP=n build
mips (and sparc32):

  In file included from arch/mips/include/asm/tlb.h:21,
                   from mm/pgtable-generic.c:9:
  include/asm-generic/tlb.h: In function `tlb_flush_mmu':
  include/asm-generic/tlb.h:76: error: implicit declaration of function `release_pages'
  include/asm-generic/tlb.h: In function `tlb_remove_page':
  include/asm-generic/tlb.h:105: error: implicit declaration of function `page_cache_release'

free_pages_and_swap_cache() and free_page_and_swap_cache() are macros
which call release_pages() and page_cache_release().  The obvious fix is
to include pagemap.h in swap.h, where those macros are defined.  But that
breaks sparc for weird reasons.

So fix it within mm/pgtable-generic.c instead.

Reported-by: Yoichi Yuasa <yuasa@linux-mips.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Sergei Shtylyov <sshtylyov@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:49:58 +10:00
Christoph Lameter
63310467a3 mm: Remove support for kmem_cache_name()
The last user was ext4 and Eric Sandeen removed the call in a recent patch. See
the following URL for the discussion:

http://marc.info/?l=linux-ext4&m=129546975702198&w=2

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-01-23 21:00:05 +02:00
Johannes Weiner
713735b423 memcg: correctly order reading PCG_USED and pc->mem_cgroup
The placement of the read-side barrier is confused: the writer first
sets pc->mem_cgroup, then PCG_USED.  The read-side barrier has to be
between testing PCG_USED and reading pc->mem_cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:06 -08:00
Jan Kara
382e27daa5 mm: fix truncate_setsize() comment
Contrary to what the comment says, truncate_setsize() should be called
*before* filesystem truncated blocks.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:06 -08:00
KAMEZAWA Hiroyuki
987eba66e0 memcg: fix rmdir, force_empty with THP
Now, when THP is enabled, memcg's rmdir() function is broken because
move_account() for THP page is not supported.

This will cause account leak or -EBUSY issue at rmdir().
This patch fixes the issue by supporting move_account() THP pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:06 -08:00
KAMEZAWA Hiroyuki
ece35ca810 memcg: fix LRU accounting with THP
memory cgroup's LRU stat should take care of size of pages because
Transparent Hugepage inserts hugepage into LRU.  If this value is the
number wrong, memory reclaim will not work well.

Note: only head page of THP's huge page is linked into LRU.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:06 -08:00
KAMEZAWA Hiroyuki
ca3e021417 memcg: fix USED bit handling at uncharge in THP
Now, under THP:

at charge:
  - PageCgroupUsed bit is set to all page_cgroup on a hugepage.
    ....set to 512 pages.
at uncharge
  - PageCgroupUsed bit is unset on the head page.

So, some pages will remain with "Used" bit.

This patch fixes that Used bit is set only to the head page.
Used bits for tail pages will be set at splitting if necessary.

This patch adds this lock order:
   compound_lock() -> page_cgroup_move_lock().

[akpm@linux-foundation.org: fix warning]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:06 -08:00
KAMEZAWA Hiroyuki
e401f1761c memcg: modify accounting function for supporting THP better
mem_cgroup_charge_statisics() was designed for charging a page but now, we
have transparent hugepage.  To fix problems (in following patch) it's
required to change the function to get the number of pages as its
arguments.

The new function gets following as argument.
  - type of page rather than 'pc'
  - size of page which is accounted.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:05 -08:00
Johannes Weiner
82478fb7bc mm: compaction: prevent division-by-zero during user-requested compaction
Up until 3e7d344 ("mm: vmscan: reclaim order-0 and use compaction instead
of lumpy reclaim"), compaction skipped calculating the fragmentation index
of a zone when compaction was explicitely requested through the procfs
knob.

However, when compaction_suitable was introduced, it did not come with an
extra check for order == -1, set on explicit compaction requests, and
passed this order on to the fragmentation index calculation, where it
overshifts the number of requested pages, leading to a division by zero.

This patch makes sure that order == -1 is recognized as the flag it is
rather than passing it along as valid order parameter.

[akpm@linux-foundation.org: add comment, per Mel]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:05 -08:00
Jesper Juhl
3305de51bf mm/vmscan.c: remove duplicate include of compaction.h
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:05 -08:00
Tomi Valkeinen
abb65272a1 memblock: fix memblock_is_region_memory()
memblock_is_region_memory() uses reserved memblocks to search for the
given region, while it should use the memory memblocks.

I encountered the problem with OMAP's framebuffer ram allocation.
Normally the ram is allocated dynamically, and this function is not
called.  However, if we want to pass the framebuffer from the bootloader
to the kernel (to retain the boot image), this function is used to check
the validity of the kernel parameters for the framebuffer ram area.

Signed-off-by: Tomi Valkeinen <tomi.valkeinen@nokia.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:05 -08:00
Johannes Weiner
453c719261 thp: keep highpte mapped until it is no longer needed
Two users reported THP-related crashes on 32-bit x86 machines.  Their oops
reports indicated an invalid pte, and subsequent code inspection showed
that the highpte is actually used after unmap.

The fix is to unmap the pte only after all operations against it are
finished.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Reported-by: werner <w.landgraf@ru.ru>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:05 -08:00
Linus Torvalds
83896fb5e5 Revert "mm: simplify code of swap.c"
This reverts commit d8505dee1a.

Chris Mason ended up chasing down some page allocation errors and pages
stuck waiting on the IO scheduler, and was able to narrow it down to two
commits: commit 744ed14427 ("mm: batch activate_page() to reduce lock
contention") and d8505dee1a ("mm: simplify code of swap.c").

This reverts the second one.

Reported-and-debugged-by: Chris Mason <chris.mason@oracle.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-17 14:42:34 -08:00
Linus Torvalds
7a608572a2 Revert "mm: batch activate_page() to reduce lock contention"
This reverts commit 744ed14427.

Chris Mason ended up chasing down some page allocation errors and pages
stuck waiting on the IO scheduler, and was able to narrow it down to two
commits: commit 744ed14427 ("mm: batch activate_page() to reduce lock
contention") and d8505dee1a ("mm: simplify code of swap.c").

This reverts the first of them.

Reported-and-debugged-by: Chris Mason <chris.mason@oracle.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-17 14:42:19 -08:00
Andrea Arcangeli
b3697c0255 fix non-x86 build failure in pmdp_get_and_clear
pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as
BUG() and they were defined only to diminish the risk of build issues on
not-x86 archs and to be consistent with the generic pte methods previously
defined in include/asm-generic/pgtable.h.

But they are causing more trouble than they were supposed to solve, so
it's simpler not to define them when THP is off.

This is also correcting the export of pmdp_splitting_flush which is
currently unused (x86 isn't using the generic implementation in
mm/pgtable-generic.c and no other arch needs that [yet]).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Sam Ravnborg <sam@ravnborg.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-16 15:05:44 -08:00
H Hartley Sweeten
68a1b19559 mm/slab.c: make local symbols static
Local symbols should be static.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-01-15 13:28:36 +02:00
Pekka Enberg
597fb188cb Merge branch 'slub/hotplug' into slab/urgent 2011-01-15 13:28:17 +02:00
Linus Torvalds
52cfd503ad Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
  ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
  ACPI: fix resource check message
  ACPI / Battery: Update information on info notification and resume
  ACPI: Drop device flag wake_capable
  ACPI: Always check if _PRW is present before trying to evaluate it
  ACPI / PM: Check status of power resources under mutexes
  ACPI / PM: Rename acpi_power_off_device()
  ACPI / PM: Drop acpi_power_nocheck
  ACPI / PM: Drop acpi_bus_get_power()
  Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
  ACPI / Fan: Rework the handling of power resources
  ACPI / PM: Register power resource devices as soon as they are needed
  ACPI / PM: Register acpi_power_driver early
  ACPI / PM: Add function for updating device power state consistently
  ACPI / PM: Add function for device power state initialization
  ACPI / PM: Introduce __acpi_bus_get_power()
  ACPI / PM: Introduce function for refcounting device power resources
  ACPI / PM: Add functions for manipulating lists of power resources
  ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
  ACPICA: Update version to 20101209
  ...
2011-01-13 20:15:35 -08:00
Daisuke Nishimura
50de1dd967 memcg: fix memory migration of shmem swapcache
In the current implementation mem_cgroup_end_migration() decides whether
the page migration has succeeded or not by checking "oldpage->mapping".

But if we are tring to migrate a shmem swapcache, the page->mapping of it
is NULL from the begining, so the check would be invalid.  As a result,
mem_cgroup_end_migration() assumes the migration has succeeded even if
it's not, so "newpage" would be freed while it's not uncharged.

This patch fixes it by passing mem_cgroup_end_migration() the result of
the page migration.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:51 -08:00
Jesper Juhl
17295c88a1 memcg: use [kv]zalloc[_node] rather than [kv]malloc+memset
In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
followed by memset() to zero the memory.  This can be more efficiently
achieved by using kzalloc() and vzalloc().  There's also one situation
where we can use kzalloc_node() - this is what's new in this version of
the patch.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:51 -08:00
Daisuke Nishimura
dfe076b097 memcg: fix deadlock between cpuset and memcg
Commit b1dd693e ("memcg: avoid deadlock between move charge and
try_charge()") can cause another deadlock about mmap_sem on task migration
if cpuset and memcg are mounted onto the same mount point.

After the commit, cgroup_attach_task() has sequence like:

cgroup_attach_task()
  ss->can_attach()
    cpuset_can_attach()
    mem_cgroup_can_attach()
      down_read(&mmap_sem)        (1)
  ss->attach()
    cpuset_attach()
      mpol_rebind_mm()
        down_write(&mmap_sem)     (2)
        up_write(&mmap_sem)
      cpuset_migrate_mm()
        do_migrate_pages()
          down_read(&mmap_sem)
          up_read(&mmap_sem)
    mem_cgroup_move_task()
      mem_cgroup_clear_mc()
        up_read(&mmap_sem)

We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).

But the commit itself is necessary to fix deadlocks which have existed
before the commit like:

Ex.1)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |  down_write(&mmap_sem)
      mc.moving_task = current          |    ..
      mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
        mem_cgroup_count_precharge()    |    prepare_to_wait()
          down_read(&mmap_sem)          |    if (mc.moving_task)
          -> cannot aquire the lock     |    -> true
                                        |      schedule()
                                        |      -> move charge should wake it up

Ex.2)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |
      mc.moving_task = current          |
      mem_cgroup_precharge_mc()         |
        mem_cgroup_count_precharge()    |
          down_read(&mmap_sem)          |
          ..                            |
          up_read(&mmap_sem)            |
                                        |  down_write(&mmap_sem)
    mem_cgroup_move_task()              |    ..
      mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
        down_read(&mmap_sem)            |    prepare_to_wait()
        -> cannot aquire the lock       |    if (mc.moving_task)
                                        |    -> true
                                        |      schedule()
                                        |      -> move charge should wake it up

This patch fixes all of these problems by:
1. revert the commit.
2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
   has released the mmap_sem.
3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
   mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
   all extra charges, wake up all waiters, and retry trylock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:51 -08:00
Minchan Kim
043d18b1e5 memcg: remove unnecessary return from void-returning mem_cgroup_del_lru_list()
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:50 -08:00
Johannes Weiner
f3e8eb70b1 memcg: fix unit mismatch in memcg oom limit calculation
Adding the number of swap pages to the byte limit of a memory control
group makes no sense.  Convert the pages to bytes before adding them.

The only user of this code is the OOM killer, and the way it is used means
that the error results in a higher OOM badness value.  Since the cgroup
limit is the same for all tasks in the cgroup, the error should have no
practical impact at the moment.

But let's not wait for future or changing users to trip over it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:50 -08:00
KAMEZAWA Hiroyuki
dbd4ea78f0 memcg: add lock to synchronize page accounting and migration
Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
accounting and migration code.  This reworks the locking scheme of
_update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
which is always taken under IRQ disable.

1. If pages are being migrated from a memcg, then updates to that
   memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
   move_lock_page_cgroup().  In an upcoming commit, memcg dirty page
   accounting will be updating memcg page accounting (specifically: num
   writeback pages) from IRQ context (softirq).  Avoid a deadlocking
   nested spin lock attempt by disabling irq on the local processor when
   grabbing the PCG_MOVE_LOCK.

2. lock for update_page_stat is used only for avoiding race with
   move_account().  So, IRQ awareness of lock_page_cgroup() itself is not
   a problem.  The problem is between mem_cgroup_update_page_stat() and
   mem_cgroup_move_account_page().

Trade-off:
  * Changing lock_page_cgroup() to always disable IRQ (or
    local_bh) has some impacts on performance and I think
    it's bad to disable IRQ when it's not necessary.
  * adding a new lock makes move_account() slower.  Score is
    here.

Performance Impact: moving a 8G anon process.

Before:
	real    0m0.792s
	user    0m0.000s
	sys     0m0.780s

After:
	real    0m0.854s
	user    0m0.000s
	sys     0m0.842s

This score is bad but planned patches for optimization can reduce
this impact.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Andrea Righi <arighi@develer.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:50 -08:00
Greg Thelen
2a7106f2cb memcg: create extensible page stat update routines
Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()

As before, only the file_mapped statistic is managed.  However, these more
general interfaces allow for new statistics to be more easily added.  New
statistics are added with memcg dirty page accounting.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:50 -08:00
Shaohua Li
744ed14427 mm: batch activate_page() to reduce lock contention
The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used.  We could do batch activate_page() to reduce the lock
contention.  The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file.  Each process read access the
whole file and then exit.  The process exit will do unmap_vmas() and cause
a lot of activate_page() call.  In such workload, we saw about 58% total
time reduction with below patch.  Other workloads with a lot of
activate_page also benefits a lot too.

I tested some microbenchmarks:
case-anon-cow-rand-mt		0.58%
case-anon-cow-rand		-3.30%
case-anon-cow-seq-mt		-0.51%
case-anon-cow-seq		-5.68%
case-anon-r-rand-mt		0.23%
case-anon-r-rand		0.81%
case-anon-r-seq-mt		-0.71%
case-anon-r-seq			-1.99%
case-anon-rx-rand-mt		2.11%
case-anon-rx-seq-mt		3.46%
case-anon-w-rand-mt		-0.03%
case-anon-w-rand		-0.50%
case-anon-w-seq-mt		-1.08%
case-anon-w-seq			-0.12%
case-anon-wx-rand-mt		-5.02%
case-anon-wx-seq-mt		-1.43%
case-fork			1.65%
case-fork-sleep			-0.07%
case-fork-withmem		1.39%
case-hugetlb			-0.59%
case-lru-file-mmap-read-mt	-0.54%
case-lru-file-mmap-read		0.61%
case-lru-file-mmap-read-rand	-2.24%
case-lru-file-readonce		-0.64%
case-lru-file-readtwice		-11.69%
case-lru-memcg			-1.35%
case-mmap-pread-rand-mt		1.88%
case-mmap-pread-rand		-15.26%
case-mmap-pread-seq-mt		0.89%
case-mmap-pread-seq		-69.72%
case-mmap-xread-rand-mt		0.71%
case-mmap-xread-seq-mt		0.38%

The most significent are:
case-lru-file-readtwice		-11.69%
case-mmap-pread-rand		-15.26%
case-mmap-pread-seq		-69.72%

which use activate_page a lot.  others are basically variations because
each run has slightly difference.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:50 -08:00
Shaohua Li
d8505dee1a mm: simplify code of swap.c
Clean up code and remove duplicate code.  Next patch will use
pagevec_lru_move_fn introduced here too.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:50 -08:00
Andrew Morton
c06b1fca18 mm/page_alloc.c: don't cache `current' in a local
It's old-fashioned and unneeded.

akpm:/usr/src/25> size mm/page_alloc.o
   text    data     bss     dec     hex filename
  39884 1241317   18808 1300009  13d629 mm/page_alloc.o (before)
  39838 1241317   18808 1299963  13d5fb mm/page_alloc.o (after)

Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:49 -08:00
Hugh Dickins
fd4a4663db mm: fix hugepage migration
2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
but its anon_vma handling was still based around the 2.6.35 conventions.
Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
drop_anon_vma in the same way as we're now changing unmap_and_move().

I don't particularly like to propose this for stable when I've not seen
its problems in practice nor tested the solution: but it's clearly out of
synch at present.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@kernel.org> [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:49 -08:00
Hugh Dickins
1ce82b69e9 mm: fix migration hangs on anon_vma lock
Increased usage of page migration in mmotm reveals that the anon_vma
locking in unmap_and_move() has been deficient since 2.6.36 (or even
earlier).  Review at the time of f18194275c
("mm: fix hang on anon_vma->root->lock") missed the issue here: the
anon_vma to which we get a reference may already have been freed back to
its slab (it is in use when we check page_mapped, but that can change),
and so its anon_vma->root may be switched at any moment by reuse in
anon_vma_prepare.

Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
then we don't need any rcu read locking over here.

In removing the rcu_unlock label: since PageAnon is a bit in
page->mapping, it's impossible for a !page->mapping page to be anon; but
insert VM_BUG_ON in case the implementation ever changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@kernel.org> [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:49 -08:00
Hugh Dickins
2919bfd075 ksm: drain pagevecs to lru
It was hard to explain the page counts which were causing new LTP tests
of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc:Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:49 -08:00
Eric B Munson
73ae31e598 hugetlb: fix handling of parse errors in sysfs
When parsing changes to the huge page pool sizes made from userspace via
the sysfs interface, bogus input values are being covered up by
nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0
when strict_strtoul returns an error.  This can cause an infinite loop in
the nr_hugepages_store code.  This patch changes the return value for
these functions to -EINVAL when strict_strtoul returns an error.

Signed-off-by: Eric B Munson <emunson@mgebm.net>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:49 -08:00
Eric B Munson
adbe8726dc hugetlb: do not allow pagesize >= MAX_ORDER pool adjustment
Huge pages with order >= MAX_ORDER must be allocated at boot via the
kernel command line, they cannot be allocated or freed once the kernel is
up and running.  Currently we allow values to be written to the sysfs and
sysctl files controling pool size for these huge page sizes.  This patch
makes the store functions for nr_hugepages and nr_overcommit_hugepages
return -EINVAL when the pool for a page size >= MAX_ORDER is changed.

[akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()]
[caiqian@redhat.com: add checking in hugetlb_overcommit_handler()]
Signed-off-by: Eric B Munson <emunson@mgebm.net>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:49 -08:00
Michal Hocko
08d4a24659 hugetlb: check the return value of string conversion in sysctl handler
proc_doulongvec_minmax may fail if the given buffer doesn't represent a
valid number.  If we provide something invalid we will initialize the
resulting value (nr_overcommit_huge_pages in this case) to a random value
from the stack.

The issue was introduced by a3d0c6aa when the default handler has been
replaced by the helper function where we do not check the return value.

Reproducer:
echo "" > /proc/sys/vm/nr_overcommit_hugepages

[akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:49 -08:00
Andrew Morton
684265d4a3 mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc()
As it stands this code will degenerate into a busy-wait if the calling task
has signal_pending().

Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00
Rolf Eike Beer
84bc227d7f mm/dmapool.c: take lock only once in dma_pool_free()
dma_pool_free() scans for the page to free in the pool list holding the
pool lock.  Then it releases the lock basically to acquire it immediately
again.  Modify the code to only take the lock once.

This will do some additional loops and computations with the lock held in
if memory debugging is activated.  If it is not activated the only new
operations with this lock is one if and one substraction.

Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00
KyongHo Cho
43506fad21 mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists
The previous approach of calucation of combined index was

	page_idx & ~(1 << order))

but we have same result with

	page_idx & buddy_idx

This reduces instructions slightly as well as enhances readability.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix used-unintialised warning]
Signed-off-by: KyongHo Cho <pullip.cho@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00
Jiri Kosina
5520e89485 brk: fix min_brk lower bound computation for COMPAT_BRK
Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
be overriden by randomize_va_space sysctl.

If this is the case, the min_brk computation in sys_brk() implementation
is wrong, as it solely takes into account COMPAT_BRK setting, assuming
that brk start is not randomized.  But that might not be the case if
randomize_va_space sysctl has been set to '2' at the time the binary has
been loaded from disk.

In such case, the check has to be done in a same way as in
!CONFIG_COMPAT_BRK case.

In addition to that, the check for the COMPAT_BRK case introduced back in
a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
but mm->end_data instead, as that's where the legacy applications expect
brk section to start (i.e.  immediately after last global variable).

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00
Jesper Juhl
32d6feadf4 mm/hugetlb.c: fix error-path memory leak in nr_hugepages_store_common()
The NODEMASK_ALLOC macro may dynamically allocate memory for its second
argument ('nodes_allowed' in this context).

In nr_hugepages_store_common() we may abort early if strict_strtoul()
fails, but in that case we do not free the memory already allocated to
'nodes_allowed', causing a memory leak.

This patch closes the leak by freeing the memory in the error path.

[akpm@linux-foundation.org: use NODEMASK_FREE, per Minchan Kim]
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00
Mel Gorman
29c1f677d4 mm: migration: use rcu_dereference_protected when dereferencing the radix tree slot during file page migration
migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
anonymous pages, as introduced by git commit
989f89c57e ("fix rcu_read_lock() in page
migraton").  The point of the RCU protection there is part of getting a
stable reference to anon_vma and is only held for anon pages as file pages
are locked which is sufficient protection against freeing.

However, while a file page's mapping is being migrated, the radix tree is
double checked to ensure it is the expected page.  This uses
radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
triggering the following warning.

[  173.674290] ===================================================
[  173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
[  173.676016] ---------------------------------------------------
[  173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
[  173.676016]
[  173.676016] other info that might help us debug this:
[  173.676016]
[  173.676016]
[  173.676016] rcu_scheduler_active = 1, debug_locks = 0
[  173.676016] 1 lock held by hugeadm/2899:
[  173.676016]  #0:  (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
[  173.676016]
[  173.676016] stack backtrace:
[  173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
[  173.676016] Call Trace:
[  173.676016]  [<c128cc01>] ? printk+0x14/0x1b
[  173.676016]  [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
[  173.676016]  [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
[  173.676016]  [<c10e41ad>] migrate_page+0x23/0x39
[  173.676016]  [<c10e491b>] buffer_migrate_page+0x22/0x107
[  173.676016]  [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
[  173.676016]  [<c10e425d>] move_to_new_page+0x9a/0x1ae
[  173.676016]  [<c10e47e6>] migrate_pages+0x1e7/0x2fa

This patch introduces radix_tree_deref_slot_protected() which calls
rcu_dereference_protected().  Users of it must pass in the
mapping->tree_lock that is protecting this dereference.  Holding the tree
lock protects against parallel updaters of the radix tree meaning that
rcu_dereference_protected is allowable.

[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Milton Miller <miltonm@bga.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>		[2.6.37.early]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00
Andrea Arcangeli
22e5c47ee2 thp: add compound_trans_head() helper
Cleanup some code with common compound_trans_head helper.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00
Andrea Arcangeli
29ad768cfc thp: KSM on THP
This makes KSM full operational with THP pages.  Subpages are scanned
while the hugepage is still in place and delivering max cpu performance,
and only if there's a match and we're going to deduplicate memory, the
single hugepages with the subpage match is split.

There will be no false sharing between ksmd and khugepaged.  khugepaged
won't collapse 2m virtual regions with KSM pages inside.  ksmd also should
only split pages when the checksum matches and we're likely to split an
hugepage for some long living ksm page (usual ksm heuristic to avoid
sharing pages that get de-cowed).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00
Andrea Arcangeli
60ab3244ec thp: khugepaged: make khugepaged aware about madvise
MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
mmap and before touching the memory.  While this is enough for most
usages, it's little effort to make madvise more dynamic at runtime on an
existing mapping by making khugepaged aware about madvise.

MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
fault (that may not ever happen if all pages are already mapped and the
"enabled" knob was set to madvise during the initial page faults).

MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
collapsing pages where not needed.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:47 -08:00
Andrea Arcangeli
a664b2d855 thp: madvise(MADV_NOHUGEPAGE)
Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
hugepage backed.  Return -EINVAL if the vma is not of an anonymous type,
or the feature isn't built into the kernel.  Never silently return
success.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:47 -08:00
Andrea Arcangeli
37c2ac7872 thp: compound_trans_order
Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:47 -08:00
Andrea Arcangeli
91600e9e59 thp: fix memory-failure hugetlbfs vs THP collision
hugetlbfs was changed to allow memory failure to migrate the hugetlbfs
pages and that broke THP as split_huge_page was then called on hugetlbfs
pages too.

compound_head/order was also run unsafe on THP pages that can be splitted
at any time.

All compound_head() invocations in memory-failure.c that are run on pages
that aren't pinned and that can be freed and reused from under us (while
compound_head is running) are buggy because compound_head can return a
dangling pointer, but I'm not fixing this as this is a generic
memory-failure bug not specific to THP but it applies to hugetlbfs too, so
I can fix it later after THP is merged upstream.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:47 -08:00
Andrea Arcangeli
14d1a55cd2 thp: add debug checks for mapcount related invariants
Add debug checks for invariants that if broken could lead to mapcount vs
page_mapcount debug checks to trigger later in split_huge_page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:47 -08:00
Rik van Riel
9992af1029 thp: scale nr_rotated to balance memory pressure
Make sure we scale up nr_rotated when we encounter a referenced
transparent huge page.  This ensures pageout scanning balance is not
distorted when there are huge pages on the LRU.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:46 -08:00
Rik van Riel
2c888cfbc1 thp: fix anon memory statistics with transparent hugepages
Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
statistics, so the Active(anon) and Inactive(anon) statistics in
/proc/meminfo are correct.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:46 -08:00
Rik van Riel
97562cd243 thp: disable transparent hugepages by default on small systems
On small systems, the extra memory used by the anti-fragmentation memory
reserve and simply because huge pages are smaller than large pages can
easily outweigh the benefits of less TLB misses.

A less obvious concern is if run on a NUMA machine with asymmetric node
sizes and one of them is very small.  The reserve could make the node
unusable.

In case of the crashdump kernel, OOMs have been observed due to the
anti-fragmentation memory reserve taking up a large fraction of the
crashdump image.

This patch disables transparent hugepages on systems with less than 1GB of
RAM, but the hugepage subsystem is fully initialized so administrators can
enable THP through /sys if desired.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Avi Kiviti <avi@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:46 -08:00
Andrea Arcangeli
c5a73c3d55 thp: use compaction for all allocation orders
It makes no sense not to enable compaction for small order pages as we
don't want to end up with bad order 2 allocations and good and graceful
order 9 allocations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:46 -08:00
Andrea Arcangeli
5a03b051ed thp: use compaction in kswapd for GFP_ATOMIC order > 0
This takes advantage of memory compaction to properly generate pages of
order > 0 if regular page reclaim fails and priority level becomes more
severe and we don't reach the proper watermarks.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:46 -08:00
Andrea Arcangeli
878aee7d6b thp: freeze khugepaged and ksmd
It's unclear why schedule friendly kernel threads can't be taken away by
the CPU through the scheduler itself.  It's safer to stop them as they can
trigger memory allocation, if kswapd also freezes itself to avoid
generating I/O they have too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:46 -08:00
Andrea Arcangeli
8ee53820ed thp: mmu_notifier_test_young
For GRU and EPT, we need gup-fast to set referenced bit too (this is why
it's correct to return 0 when shadow_access_mask is zero, it requires
gup-fast to set the referenced bit).  qemu-kvm access already sets the
young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
paging EPT minor fault we relay on gup-fast to signal the page is in
use...

We also need to check the young bits on the secondary pagetables for NPT
and not nested shadow mmu as the data may never get accessed again by the
primary pte.

Without this closer accuracy, we'd have to remove the heuristic that
avoids collapsing hugepages in hugepage virtual regions that have not even
a single subpage in use.

->test_young is full backwards compatible with GRU and other usages that
don't have young bits in pagetables set by the hardware and that should
nuke the secondary mmu mappings when ->clear_flush_young runs just like
EPT does.

Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but
I thought it was worth it and this makes it reliable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:46 -08:00
Andrea Arcangeli
4b7167b9ff thp: don't allow transparent hugepage support without PSE
Archs implementing Transparent Hugepage Support must implement a function
called has_transparent_hugepage to be sure the virtual or physical CPU
supports Transparent Hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:45 -08:00
Andrea Arcangeli
94fcc585fb thp: avoid breaking huge pmd invariants in case of vma_adjust failures
An huge pmd can only be mapped if the corresponding 2M virtual range is
fully contained in the vma.  At times the VM calls split_vma twice, if the
first split_vma succeeds and the second fail, the first split_vma remains
in effect and it's not rolled back.  For split_vma or vma_adjust to fail
an allocation failure is needed so it's a very unlikely event (the out of
memory killer would normally fire before any allocation failure is visible
to kernel and userland and if an out of memory condition happens it's
unlikely to happen exactly here).  Nevertheless it's safer to ensure that
no huge pmd can be left around if the vma is adjusted in a way that can't
fit hugepages anymore at the new vm_start/vm_end address.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:45 -08:00
Andrea Arcangeli
bc835011af thp: transhuge isolate_migratepages()
It's not worth migrating transparent hugepages during compaction.  Those
hugepages don't create fragmentation.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:45 -08:00
Andrea Arcangeli
5d6892407c thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
With transparent hugepage support we need compaction for the "defrag"
sysfs controls to be effective.

At the moment THP hangs the system if COMPACTION isn't selected, as
without COMPACTION lumpy reclaim wouldn't be entirely disabled.  So at the
moment it's not orthogonal.  When lumpy will be removed from the VM I can
remove the select COMPACTION in theory, but then 99% of THP users would be
still doing a mistake in disabling compaction, even if the mistake won't
return in fatal runtime but just slightly degraded performance.  So from a
theoretical standpoing forcing the below select is not needed (the
dependency isn't strict nor at compile time nor at runtime) but from a
practical standpoint it is safer.

If anybody really wants THP to run without compaction, it'd be such a
weird setup that editing the Kconfig file to allow it will be surely not a
problem.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:45 -08:00
Andrea Arcangeli
13ece886d9 thp: transparent hugepage config choice
Allow to choose between the always|madvise default for page faults and
khugepaged at config time.  madvise guarantees zero risk of higher memory
footprint for applications (applications using madvise(MADV_HUGEPAGE)
won't risk to use any more memory by backing their virtual regions with
hugepages).

Initially set the default to N and don't depend on EMBEDDED.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:45 -08:00
Andrea Arcangeli
ce83d2174e thp: allocate memory in khugepaged outside of mmap_sem write mode
This tries to be more friendly to filesystem in userland, with userland
backends that allocate memory in the I/O paths and that could deadlock if
khugepaged holds the mmap_sem write mode of the userland backend while
allocating memory.  Memory allocation may wait for writeback I/O
completion from the daemon that may be blocked in the mmap_sem read mode
if a page fault happens and the daemon wasn't using mlock for the memory
required for the I/O submission and completion.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:45 -08:00
Andrea Arcangeli
0bbbc0b33d thp: add numa awareness to hugepage allocations
It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
introducing alloc_pages_vma.  khugepaged needs special handling as the
allocation has to happen inside collapse_huge_page where the vma is known
and an error has to be returned to the outer loop to sleep
alloc_sleep_millisecs in case of failure.  But it retains the more
efficient logic of handling allocation failures in khugepaged in case of
CONFIG_NUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:45 -08:00
Andrea Arcangeli
d39d33c332 thp: enable direct defrag
With memory compaction in, and lumpy-reclaim disabled, it seems safe
enough to defrag memory during the (synchronous) transparent hugepage page
faults (TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) and not only during khugepaged
(async) hugepage allocations that was already enabled even before memory
compaction was in (TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:45 -08:00
Andrea Arcangeli
f000565adb thp: set recommended min free kbytes
If transparent hugepage is enabled initialize min_free_kbytes to an
optimal value by default.  This moves the hugeadm algorithm in kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:44 -08:00
Johannes Weiner
cd7548ab36 thp: mprotect: transparent huge page support
Natively handle huge pmds when changing page tables on behalf of
mprotect().

I left out update_mmu_cache() because we do not need it on x86 anyway but
more importantly the interface works on ptes, not pmds.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:44 -08:00
Johannes Weiner
b36f5b0710 thp: mprotect: pass vma down to page table walkers
Flushing the tlb for huge pmds requires the vma's anon_vma, so pass along
the vma instead of the mm, we can always get the latter when we need it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:44 -08:00
Johannes Weiner
0ca1634d41 thp: mincore transparent hugepage support
Handle transparent huge page pmd entries natively instead of splitting
them into subpages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:44 -08:00
Johannes Weiner
f2d6bfe9ff thp: add x86 32bit support
Add support for transparent hugepages to x86 32bit.

Share the same VM_ bitflag for VM_MAPPED_COPY.  mm/nommu.c will never
support transparent hugepages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:44 -08:00
Andrea Arcangeli
5f24ce5fd3 thp: remove PG_buddy
PG_buddy can be converted to _mapcount == -2.  So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y.  This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes.  We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Andrea Arcangeli
21ae5b0175 thp: skip transhuge pages in ksm for now
Skip transhuge pages in ksm for now.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Andrea Arcangeli
b15d00b6af thp: khugepaged vma merge
register in khugepaged if the vma grows.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Andrea Arcangeli
ba76149f47 thp: khugepaged
Add khugepaged to relocate fragmented pages into hugepages if new
hugepages become available.  (this is indipendent of the defrag logic that
will have to make new hugepages available)

The fundamental reason why khugepaged is unavoidable, is that some memory
can be fragmented and not everything can be relocated.  So when a virtual
machine quits and releases gigabytes of hugepages, we want to use those
freely available hugepages to create huge-pmd in the other virtual
machines that may be running on fragmented memory, to maximize the CPU
efficiency at all times.  The scan is slow, it takes nearly zero cpu time,
except when it copies data (in which case it means we definitely want to
pay for that cpu time) so it seems a good tradeoff.

In addition to the hugepages being released by other process releasing
memory, we have the strong suspicion that the performance impact of
potentially defragmenting hugepages during or before each page fault could
lead to more performance inconsistency than allocating small pages at
first and having them collapsed into large pages later...  if they prove
themselfs to be long lived mappings (khugepaged scan is slow so short
lived mappings have low probability to run into khugepaged if compared to
long lived mappings).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Andrea Arcangeli
79134171df thp: transparent hugepage vmstat
Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Andrea Arcangeli
b9bbfbe30a thp: memcg huge memory
Add memcg charge/uncharge to hugepage faults in huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Daisuke Nishimura
152c9ccb75 thp: transhuge-memcg: commit tail pages at charge
By this patch, when a transparent hugepage is charged, not only the head
page but also all the tail pages are committed, IOW pc->mem_cgroup and
pc->flags of tail pages are set.

Without this patch:

- Tail pages are not linked to any memcg's LRU at splitting. This causes many
  problems, for example, the charged memcg's directory can never be rmdir'ed
  because it doesn't have enough pages to scan to make the usage decrease to 0.
- "rss" field in memory.stat would be incorrect. Moreover, usage_in_bytes in
  root cgroup is calculated by the stat not by res_counter(since 2.6.32),
  it would be incorrect too.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Andrea Arcangeli
ec1685109f thp: memcg compound
Teach memcg to charge/uncharge compound pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Andrea Arcangeli
500d65d471 thp: pmd_trans_huge migrate bugcheck
No pmd_trans_huge should ever materialize in migration ptes areas, because
we split the hugepage before migration ptes are instantiated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:42 -08:00
Andrea Arcangeli
0af4e98b6b thp: madvise(MADV_HUGEPAGE)
Add madvise MADV_HUGEPAGE to mark regions that are important to be
hugepage backed.  Return -EINVAL if the vma is not of an anonymous type,
or the feature isn't built into the kernel.  Never silently return
success.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:42 -08:00
Andrea Arcangeli
f66055ab6f thp: verify pmd_trans_huge isn't leaking
pte_trans_huge must not leak in certain vmas like the mmio special pfn or
filebacked mappings.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:42 -08:00
Andrea Arcangeli
05759d380a thp: split_huge_page anon_vma ordering dependency
This documents how split_huge_page is safe vs new vma inserctions into the
anon_vma that may have already released the anon_vma->lock but not
established pmds yet when split_huge_page starts.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:42 -08:00
Hugh Dickins
8a07651ee8 thp: transparent hugepage core fixlet
If you configure THP in addition to HUGETLB_PAGE on x86_32 without PAE,
the p?d-folding works out that munlock_vma_pages_range() can crash to
follow_page()'s pud_huge() BUG_ON(flags & FOLL_GET): it needs the same
VM_HUGETLB check already there on the pmd_huge() line.  Conveniently,
openSUSE provides a "blogd" which tests this out at startup!

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:42 -08:00
Andrea Arcangeli
71e3aac072 thp: transparent hugepage core
Lately I've been working to make KVM use hugepages transparently without
the usual restrictions of hugetlbfs.  Some of the restrictions I'd like to
see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
   not null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization,
becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
to 19 in case only the hypervisor uses transparent hugepages, and they
decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
linux hypervisor and the linux guest both uses this patch (though the
guest will limit the addition speedup to anonymous regions only for
now...).  Even more important is that the tlb miss handler is much slower
on a NPT/EPT guest than for a regular shadow paging or no-virtualization
scenario.  So maximizing the amount of virtual memory cached by the TLB
pays off significantly more with NPT/EPT than without (even if there would
be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on
regular anonymous vmas.  This is what this patch tries to achieve in the
least intrusive possible way.  We want hugepages and hugetlb to be used in
a way so that all applications can benefit without changes (as usual we
leverage the KVM virtualization design: by improving the Linux VM at
large, KVM gets the performance boost too).

The most important design choice is: always fallback to 4k allocation if
the hugepage allocation fails!  This is the _very_ opposite of some large
pagecache patches that failed with -EIO back then if a 64k (or similar)
allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an operation
that can't fail.  This way the reliability of the swapping isn't decreased
(no need to allocate memory when we are short on memory to swap) and it's
trivial to plug a split_huge_page* one-liner where needed without
polluting the VM.  Over time we can teach mprotect, mremap and friends to
handle pmd_trans_huge natively without calling split_huge_page*.  The fact
it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
(instead of the current void) we'd need to rollback the mprotect from the
middle of it (ideally including undoing the split_vma) which would be a
big change and in the very wrong direction (it'd likely be simpler not to
call split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle).  In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
incremental and it'll just be an "harmless" addition later if this initial
part is agreed upon.  It also should be noted that locking-wise replacing
regular pages with hugepages is going to be very easy if compared to what
I'm doing below in split_huge_page, as it will only happen when
page_count(page) matches page_mapcount(page) if we can take the PG_lock
and mmap_sem in write mode.  collapse_huge_page will be a "best effort"
that (unlike split_huge_page) can fail at the minimal sign of trouble and
we can try again later.  collapse_huge_page will be similar to how KSM
works and the madvise(MADV_HUGEPAGE) will work similar to
madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault
time.  This can be changed with
/sys/kernel/mm/transparent_hugepage/enabled.  The control knob can be set
to three values "always", "madvise", "never" which mean respectively that
hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
or never used.  /sys/kernel/mm/transparent_hugepage/defrag instead
controls if the hugepage allocation should defrag memory aggressively
"always", only inside "madvise" regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid.  The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head.  I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point of
view.  In short there is no current existing way to serialize the O_DIRECT
final put_page against split_huge_page_refcount so I had to invent a new
one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
returns so...).  And I didn't want to impact all gup/gup_fast users for
now, maybe if we change the gup interface substantially we can avoid this
locking, I admit I didn't think too much about it because changing the gup
unpinning interface would be invasive.

If we ignored O_DIRECT we could stick to the existing compound refcounting
code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
(and any other mmu notifier user) would call it without FOLL_GET (and if
FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
current task mmu notifier list yet).  But O_DIRECT is fundamental for
decent performance of virtualized I/O on fast storage so we can't avoid it
to solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;).  MMU
notifier is handled transparently too, with the exception of the young bit
on the pmd, that didn't have a range check but I think KVM will be fine
because the whole point of hugepages is that EPT/NPT will also use a huge
pmd when they notice gup returns pages with PageCompound set, so they
won't care of a range and there's just the pmd young bit to check in that
case.

NOTE: in some cases if the L2 cache is small, this may slowdown and waste
memory during COWs because 4M of memory are accessed in a single fault
instead of 8k (the payoff is that after COW the program can run faster).
So we might want to switch the copy_huge_page (and clear_huge_page too) to
not temporal stores.  I also extensively researched ways to avoid this
cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
up to 1M (I can send those patches that fully implemented prefault) but I
concluded they're not worth it and they add an huge additional complexity
and they remove all tlb benefits until the full hugepage has been faulted
in, to save a little bit of memory and some cache during app startup, but
they still don't improve substantially the cache-trashing during startup
if the prefault happens in >4k chunks.  One reason is that those 4k pte
entries copied are still mapped on a perfectly cache-colored hugepage, so
the trashing is the worst one can generate in those copies (cow of 4k page
copies aren't so well colored so they trashes less, but again this results
in software running faster after the page fault).  Those prefault patches
allowed things like a pte where post-cow pages were local 4k regular anon
pages and the not-yet-cowed pte entries were pointing in the middle of
some hugepage mapped read-only.  If it doesn't payoff substantially with
todays hardware it will payoff even less in the future with larger l2
caches, and the prefault logic would blot the VM a lot.  If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or
with the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that
will ensure not a single hugepage is allocated at boot time.  It is simple
enough to just disable transparent hugepage globally and let transparent
hugepages be allocated selectively by applications in the MADV_HUGEPAGE
region (both at page fault time, and if enabled with the
collapse_huge_page too through the kernel daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone.  Also some archs like
power have certain tlb limits that prevents mixing different page size in
the same regions so they will not fit in this framework that requires
"graceful fallback" to basic PAGE_SIZE in case of physical memory
fragmentation.  hugetlbfs remains a perfect fit for those because its
software limits happen to match the hardware limits.  hugetlbfs also
remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
to be found not fragmented after a certain system uptime and that would be
very expensive to defragment with relocation, so requiring reservation.
hugetlbfs is the "reservation way", the point of transparent hugepages is
not to have any reservation at all and maximizing the use of cache and
hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:42 -08:00
Andrea Arcangeli
5c3240d92e thp: don't alloc harder for gfp nomemalloc even if nowait
Not worth throwing away the precious reserved free memory pool for
allocations that can fail gracefully (either through mempool or because
they're transhuge allocations later falling back to 4k allocations).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:42 -08:00
Andrea Arcangeli
32dba98e08 thp: _GFP_NO_KSWAPD
Transparent hugepage allocations must be allowed not to invoke kswapd or
any other kind of indirect reclaim (especially when the defrag sysfs is
control disabled).  It's unacceptable to swap out anonymous pages
(potentially anonymous transparent hugepages) in order to create new
transparent hugepages.  This is true for the MADV_HUGEPAGE areas too
(swapping out a kvm virtual machine and so having it suffer an unbearable
slowdown, so another one with guest physical memory marked MADV_HUGEPAGE
can run 30% faster if it is running memory intensive workloads, makes no
sense).  If a transparent hugepage allocation fails the slowdown is minor
and there is total fallback, so kswapd should never be asked to swapout
memory to allow the high order allocation to succeed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:41 -08:00
Andrea Arcangeli
47ad8475c0 thp: clear_copy_huge_page
Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:41 -08:00
Andrea Arcangeli
3f04f62f90 thp: split_huge_page paging
Paging logic that splits the page before it is unmapped and added to swap
to ensure backwards compatibility with the legacy swap code.  Eventually
swap should natively pageout the hugepages to increase performance and
decrease seeking and fragmentation of swap space.  swapoff can just skip
over huge pmd as they cannot be part of swap yet.  In add_to_swap be
careful to split the page only if we got a valid swap entry so we don't
split hugepages with a full swap.

In theory we could split pages before isolating them during the lru scan,
but for khugepaged to be safe, I'm relying on either mmap_sem write mode,
or PG_lock taken, so split_huge_page has to run either with mmap_sem
read/write mode or PG_lock taken.  Calling it from isolate_lru_page would
make locking more complicated, in addition to that split_huge_page would
deadlock if called by __isolate_lru_page because it has to take the lru
lock to add the tail pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:41 -08:00
Andrea Arcangeli
bae9c19bf1 thp: split_huge_page_mm/vma
split_huge_page_pmd compat code.  Each one of those would need to be
expanded to hundred of lines of complex code without a fully reliable
split_huge_page_pmd design.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:41 -08:00