linux

Author	SHA1	Message	Date
Yasunori Goto	281dd25cdc	[PATCH] swiotlb: make sure initial DMA allocations really are in DMA memory This introduces a limit parameter to the core bootmem allocator; The new parameter indicates that physical memory allocated by the bootmem allocator should be within the requested limit. We also introduce alloc_bootmem_low_pages_limit, alloc_bootmem_node_limit, alloc_bootmem_low_pages_node_limit apis, but alloc_bootmem_low_pages_limit is the only api used for swiotlb. The existing alloc_bootmem_low_pages() api could instead have been changed and made to pass right limit to the core allocator. But that would make the patch more intrusive for 2.6.14, as other arches use alloc_bootmem_low_pages(). We may be done that post 2.6.14 as a cleanup. With this, swiotlb gets memory within 4G for both x86_64 and ia64 arches. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Ravikiran G Thirumalai <kiran@scalex86.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-19 23:11:33 -07:00
Hugh Dickins	1c59827d1d	[PATCH] mm: hugetlb truncation fixes hugetlbfs allows truncation of its files (should it?), but hugetlb.c often forgets that: crashes and misaccounting ensue. copy_hugetlb_page_range better grab the src page_table_lock since we don't want to guess what happens if concurrently truncated. unmap_hugepage_range rss accounting must not assume the full range was mapped. follow_hugetlb_page must guard with page_table_lock and be prepared to exit early. Restyle copy_hugetlb_page_range with a for loop like the others there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-19 23:04:30 -07:00
Seth, Rohit	3359b54c8c	[PATCH] Handle spurious page fault for hugetlb region The hugetlb pages are currently pre-faulted. At the time of mmap of hugepages, we populate the new PTEs. It is possible that HW has already cached some of the unused PTEs internally. These stale entries never get a chance to be purged in existing control flow. This patch extends the check in page fault code for hugepages. Check if a faulted address falls with in size for the hugetlb file backing it. We return VM_FAULT_MINOR for these cases (assuming that the arch specific page-faulting code purges the stale entry for the archs that need it). Signed-off-by: Rohit Seth <rohit.seth@intel.com> [ This is apparently arguably an ia64 port bug. But the code won't hurt, and for now it fixes a real problem on some ia64 machines ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-19 13:56:27 -07:00
Linus Torvalds	3d80636a0d	Fix memory ordering bug in page reclaim As noticed by Nick Piggin, we need to make sure that we check the page count before we check for PageDirty, since the dirty check is only valid if the count implies that we're the only possible ones holding the page. We always did do this, but the code needs a read-memory-barrier to make sure that the orderign is also honored by the CPU. (The writer side is ordered due to the atomic decrement and test on the page count, see the discussion on linux-kernel) Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-16 17:36:06 -07:00
Hugh Dickins	f5154a98a1	[PATCH] Don't map the same page too much Refuse to install a page into a mapping if the mapping count is already ridiculously large. You probably cannot trigger this on 32-bit architectures, but on a 64-bit setup we should protect against it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-11 12:03:47 -07:00
Suzuki	1bef400329	[PATCH] madvise: Avoid returning error code -EBADF for anonymous mappings Revert this recent correctness change: Douglas Crosher <dcrosher@scieneer.com> reported that it broke an existing application, and that madvise() works without error on anonymous mappings on Solaris. This means that madvise() will remain non-standards-compliant: we should return -EBADF for all requests against non-file-backed vma's, but Linux only does this for MADV_WILLNEED requests. Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-11 09:46:54 -07:00
Al Viro	dd0fc66fb3	[PATCH] gfp flags annotations - part 1 - added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-08 15:00:57 -07:00
Linus Torvalds	6e3254c4e2	Revert "x86-64: Reverse order of bootmem lists" As requested by Thomas Gleixner <tglx@linutronix.de>: "5d3d0f7704ed0bc7eaca0501eeae3e5da1ea6c87 breaks a couple of ARM boards, which depend on the historical bootmem allocation order. There is a cleaner solution around to remove the pgdat list completely, but this is a topic for post 2.6.14 Andi signalled ACK already." Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-30 12:38:27 -07:00
Alok N Kataria	5c38230087	[PATCH] kmalloc_node IRQ safety fix In kmalloc_node we are checking if the allocation is for the same node when interrupts are "on". This may lead to an allocation on another node than intended. This patch just shifts the check for the current node in __cache_alloc_node when interrupts are disabled. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-28 07:46:42 -07:00
Nick Piggin	8b1f312461	[PATCH] mm: move_pte to remap ZERO_PAGE Move the ZERO_PAGE remapping complexity to the move_pte macro in asm-generic, have it conditionally depend on __HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS. For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes a noop. From: Hugh Dickins <hugh@veritas.com> Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch. The "pte" at that point may be a swap entry or a pte_file entry: we must check pte_present before perhaps corrupting such an entry. Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's mm/mremap.c, and more dangerous there since it's affecting all arches: I think the safest course is to send Nick's patch and Yoichi's build fix and this fix (build tested) on to Linus - so only MIPS can be affected. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-28 07:46:40 -07:00
Andrew Morton	dbdb904500	[PATCH] revert oversized kmalloc check As davem points out, this wasn't such a great idea. There may be some code which does: size = 1024*1024; while (kmalloc(size, ...) == 0) size /= 2; which will now explode. Cc: "David S. Miller" <davem@davemloft.net> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-23 13:35:37 -07:00
Rob Landley	f7b3a4359b	[PATCH] Fix bd_claim() error code. Problem: In some circumstances, bd_claim() is returning the wrong error code. If we try to swapon an unused block device that isn't swap formatted, we get -EINVAL. But if that same block device is already mounted, we instead get -EBUSY, even though it still isn't a valid swap device. This issue came up on the busybox list trying to get the error message from "swapon -a" right. If a swap device is already enabled, we get -EBUSY, and we shouldn't report this as an error. But we can't distinguish the two -EBUSY conditions, which are very different errors. In the code, bd_claim() returns either 0 or -EBUSY, but in this case busy means "somebody other than sys_swapon has already claimed this", and _that_ means this block device can't be a valid swap device. So return -EINVAL there. Signed-off-by: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-22 22:17:37 -07:00
Christoph Lameter	eafb42707b	[PATCH] __kmalloc: Generate BUG if size requested is too large. I had an issue on ia64 where I got a bug in kernel/workqueue because kzalloc returned a NULL pointer due to the task structure getting too big for the slab allocator. Usually these cases are caught by the kmalloc macro in include/linux/slab.h. Compilation will fail if a too big value is passed to kmalloc. However, kzalloc uses __kmalloc which has no check for that. This patch makes __kmalloc bug if a too large entity is requested. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-22 22:17:36 -07:00
Christoph Lameter	ff69416e63	[PATCH] slab: fix handling of pages from foreign NUMA nodes The numa slab allocator may allocate pages from foreign nodes onto the lists for a particular node if a node runs out of memory. Inspecting the slab->nodeid field will not reflect that the page is now in use for the slabs of another node. This patch fixes that issue by adding a node field to free_block so that the caller can indicate which node currently uses a slab. Also removes the check for the current node from kmalloc_cache_node since the process may shift later to another node which may lead to an allocation on another node than intended. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-22 22:17:35 -07:00
Ivan Kokshaysky	7243cc05ba	[PATCH] slab: alpha inlining fix It is essential that index_of() be inlined. But alpha undoes the gcc inlining hackery and index_of() ends up out-of-line. So fiddle with things to make that function inline again. Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-22 22:17:34 -07:00
Paolo 'Blaisorblade' Giarrusso	7e2cff42cf	[PATCH] mm: add a note about partially hardcoded VM_* flags Hugh made me note this line for permission checking in mprotect(): if ((newflags & ~(newflags >> 4)) & 0xf) { after figuring out what's that about, I decided it's nasty enough. Btw Hugh itself didn't like the 0xf. We can safely change it to VM_READ\|VM_WRITE\|VM_EXEC because we never change VM_SHARED, so no need to check that. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-21 10:11:55 -07:00
Paolo 'Blaisorblade' Giarrusso	f10df68604	[PATCH] fix locking comment in unmap_region() That comment is plain wrong (we even take the pagetable lock inside unmap_region()). Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-21 10:11:55 -07:00
Dave Hansen	f3519f9194	[PATCH] fix mm/Kconfig spelling Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-17 11:50:01 -07:00
Alok Kataria	c7e43c78ae	[PATCH] Fix slab BUG_ON() triggered by change in array cache size With the new changes that we made in the initialization of the slab allocator, we first setup the cache from which array caches are allocated, and then the cache, from which kmem_list3's are allocated. Now if the array cache comes from a cache in which objsize > 32, (in this instance size-64) then, first size-64 cache will be allocated and then the size-128 (if this is the cache from which kmem_list3's are going to be allocated). So with these new changes, we are not guaranteed that we will be initializing the malloc_sizes array in a serialized order. Thus there is a bug in __find_general_cachep, as we are checking whether the first cache_sizes ptr is NULL. This is replaced by checking whether the array-cache cache is initialized. Attached is a patch which does that. Boots fine on a x86-64, with DEBUG_SPIN, DEBUG_SLAB, and preempt. Attached is a patch which does that. Boots fine on a x86-64, with DEBUG_SPIN, DEBUG_SLAB, and preempt.Thanks & Regards, Alok Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhitdayal.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-14 12:31:45 -07:00
Hugh Dickins	2fd4ef85e0	[PATCH] error path in setup_arg_pages() misses vm_unacct_memory() Pavel Emelianov and Kirill Korotaev observe that fs and arch users of security_vm_enough_memory tend to forget to vm_unacct_memory when a failure occurs further down (typically in setup_arg_pages variants). These are all users of insert_vm_struct, and that reservation will only be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't. So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into Committed_AS each time they're run. But don't add VM_ACCOUNT to them, it's inappropriate to reserve against the very unlikely case that gdb be used to COW a vDSO page - we ought to do something about that in do_wp_page, but there are yet other inconsistencies to be resolved. The safe and economical way to fix this is to let insert_vm_struct do the security_vm_enough_memory check when it finds VM_ACCOUNT is set. And the MIPS irix_brk has been calling security_vm_enough_memory before calling do_brk which repeats it, doubly accounting and so also leaking. Remove that, and all the fs and arch calls to security_vm_enough_memory: give it a less misleading name later on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-Off-By: Kirill Korotaev <dev@sw.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-14 11:18:13 -07:00
Randy Dunlap	9f1583339a	[PATCH] use add_taint() for setting tainted bit flags Use the add_taint() interface for setting tainted bit flags instead of doing it manually. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-13 08:22:29 -07:00
Andi Kleen	5b952b3c14	[PATCH] Fix MPOL_F_VERIFY There was a pretty bad bug in there that the code would always check the full VMA, not the range the user requested. When the VMA to be checked was merged with the previous VMA this could lead to spurious failures. Signed-off-by: "Andi Kleen" <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-13 08:22:28 -07:00
Con Kolivas	8d0986e289	[PATCH] vm: kswapd cleanup: use pgdat Use the pgdat pointer we've already defined in wakeup_kswapd Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-13 08:22:28 -07:00
Andi Kleen	5d3d0f7704	[PATCH] x86-64: Reverse order of bootmem lists This leads to bootmem allocating first from node 0 instead of from the last node. This avoids swiotlb allocating on the last node, which doesn't really work on a machine with >4GB. Note: there is a better patch around from someone else that gets rid of the pgdat list completely. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-12 10:49:56 -07:00
Greg Ungerer	66aa2b4b1c	[PATCH] uclinux: add NULL check, 0 end valid check and some more exports to nommu.c Move call to get_mm_counter() in update_mem_hiwater() to be inside the check for tsk->mm being null. Otherwise you can be following a null pointer here. This patch submitted by Javier Herrero <jherrero@hvsistemas.es>. Modify the end check for munmap regions to allow for the legacy behavior of 0 being valid. Pretty much all current uClinux system libc malloc's pass in 0 as the end point. A hard check will fail on these, so change the check so that if it is non-zero it must be valid otherwise it fails. A passed in value will always succeed (as it used too). Also export a few more mm system functions - to be consistent with the VM code exports. Signed-off-by: Greg Ungerer <gerg@uclinux.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-11 20:43:47 -07:00
Nishanth Aravamudan	13e4b57f6a	[PATCH] mm: fix-up schedule_timeout() usage Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:37 -07:00
Renaud Lienhart	207f36eec9	[PATCH] remove invalid comment in mm/page_alloc.c free_pages_bulk() doesn't free the entire list if count == 0. Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:31 -07:00
Victor Fusco	9de75d110c	[PATCH] mm/swap_state: Fix "nocast type" warnings Fix the sparse warning "implicit cast to nocast type" Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:28 -07:00
Victor Fusco	b2d550736f	[PATCH] mm/slab: fix sparse warnings Fix the sparse warning "implicit cast to nocast type" Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:26 -07:00
Adrian Bunk	5ce7852cdf	[PATCH] mm/filemap.c: make two functions static With Nick Piggin <npiggin@suse.de> Give some things static scope. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:25 -07:00
Ingo Molnar	8d06afab73	[PATCH] timer initialization cleanup: DEFINE_TIMER Clean up timer initialization by introducing DEFINE_TIMER a'la DEFINE_SPINLOCK. Build and boot-tested on x86. A similar patch has been been in the -RT tree for some time. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 14:03:48 -07:00
Pekka Enberg	80e93effce	[PATCH] update kfree, vfree, and vunmap kerneldoc This patch clarifies NULL handling of kfree() and vfree(). I addition, wording of calling context restriction for vfree() and vunmap() are changed from "may not" to "must not." Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 14:03:43 -07:00
Christoph Lameter	e498be7daf	[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 13:57:48 -07:00
Stephen Smalley	570bc1c2e5	[PATCH] tmpfs: Enable atomic inode security labeling This patch modifies tmpfs to call the inode_init_security LSM hook to set up the incore inode security state for new inodes before the inode becomes accessible via the dcache. As there is no underlying storage of security xattrs in this case, it is not necessary for the hook to return the (name, value, len) triple to the tmpfs code, so this patch also modifies the SELinux hook function to correctly handle the case where the (name, value, len) pointers are NULL. The hook call is needed in tmpfs in order to support proper security labeling of tmpfs inodes (e.g. for udev with tmpfs /dev in Fedora). With this change in place, we should then be able to remove the security_inode_post_create/mkdir/... hooks safely. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 13:57:28 -07:00
Mark Fasheh	fef266580e	[PATCH] update filesystems for new delete_inode behavior Update the file systems in fs/ implementing a delete_inode() callback to call truncate_inode_pages(). One implementation note: In developing this patch I put the calls to truncate_inode_pages() at the very top of those filesystems delete_inode() callbacks in order to retain the previous behavior. I'm guessing that some of those could probably be optimized. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 13:57:27 -07:00
Andi Kleen	d42c69972b	[PATCH] PCI: Run PCI driver initialization on local node Run PCI driver initialization on local node Instead of adding messy kmalloc_node()s everywhere run the PCI driver probe on the node local to the device. This would not have helped for IDE, but should for other more clean drivers that do more initialization in probe(). It won't help for drivers that do most of the work on first open (like many network drivers) Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2005-09-08 14:57:23 -07:00
Pekka J Enberg	dd3927105b	[PATCH] introduce and use kzalloc This patch introduces a kzalloc wrapper and converts kernel/ to use it. It saves a little program text. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:45 -07:00
Paul Jackson	ef08e3b498	[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:40 -07:00
Paul Jackson	9bf2229f88	[PATCH] cpusets: formalize intermediate GFP_KERNEL containment This patch makes use of the previously underutilized cpuset flag 'mem_exclusive' to provide what amounts to another layer of memory placement resolution. With this patch, there are now the following four layers of memory placement available: 1) The whole system (interrupt and GFP_ATOMIC allocations can use this), 2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use), 3) The current tasks cpuset (GFP_USER allocations constrained to here), and 4) Specific node placement, using mbind and set_mempolicy. These nest - each layer is a subset (same or within) of the previous. Layer (2) above is new, with this patch. The call used to check whether a zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is extended to take a gfp_mask argument, and its logic is extended, in the case that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if placement is allowed. The definition of GFP_USER, which used to be identical to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous cpuset_gfp_hardwall_flag patch. GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks cpuset, so long as any node therein is not too tight on memory, but will escape to the larger layer, if need be. The intended use is to allow something like a batch manager to handle several jobs, each job in its own cpuset, but using common kernel memory for caches and such. Swapper and oom_kill activity is also constrained to Layer (2). A task in or below one mem_exclusive cpuset should not cause swapping on nodes in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a task in another such cpuset. Heavy use of kernel memory for i/o caching and such by one job should not impact the memory available to jobs in other non-overlapping mem_exclusive cpusets. This patch enables providing hardwall, inescapable cpusets for memory allocations of each job, while sharing kernel memory allocations between several jobs, in an enclosing mem_exclusive cpuset. Like Dinakar's patch earlier to enable administering sched domains using the cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag that had previously done nothing much useful other than restrict what cpuset configurations were allowed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:40 -07:00
Paul Jackson	a49335ccea	[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:39 -07:00
Ravikiran G Thirumalai	6c231b7bab	[PATCH] Additions to .data.read_mostly section Mark variables which are usually accessed for reads with __readmostly. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:33 -07:00
Steven Pratt	3b30bbd963	[PATCH] readahead: reset cache_hit earlier We don't reset the cache hit count until after readahead does a successful readahead. This seems to leave a corner case open where we miss in cache, but don't restart the readhead right away. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:25 -07:00
Christoph Hellwig	cdb3826b99	[PATCH] remove misleading comment above sys_brk Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:23 -07:00
Christoph Lameter	c3d8c14145	[PATCH] More __read_mostly variables Move some more frequently read variables that showed up during some of our performance tests as sometimes ending up in hot cachelines to the read_mostly section. Fix: Move the __read_mostly from before hpet_usec_quotient to follow the variable like the other uses of __read_mostly. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Christoph Lameter <christoph@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:18 -07:00
Stephen Smalley	f549d6c18c	[PATCH] Generic VFS fallback for security xattrs This patch modifies the VFS setxattr, getxattr, and listxattr code to fall back to the security module for security xattrs if the filesystem does not support xattrs natively. This allows security modules to export the incore inode security label information to userspace even if the filesystem does not provide xattr storage, and eliminates the need to individually patch various pseudo filesystem types to provide such access. The patch removes the existing xattr code from devpts and tmpfs as it is then no longer needed. The patch restructures the code flow slightly to reduce duplication between the normal path and the fallback path, but this should only have one user-visible side effect - a program may get -EACCES rather than -EOPNOTSUPP if policy denied access but the filesystem didn't support the operation anyway. Note that the post_setxattr hook call is not needed in the fallback case, as the inode_setsecurity hook call handles the incore inode security state update directly. In contrast, we do call fsnotify in both cases. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:52 -07:00
Martin Hicks	c07e02db76	[PATCH] VM: add page_state info to per-node meminfo Add page_state info to the per-node meminfo file in sysfs. This is mostly just for informational purposes. The lack of this information was brought up recently during a discussion regarding pagecache clearing, and I put this patch together to test out one of the suggestions. It seems like interesting info to have, so I'm submitting the patch. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:49 -07:00
Manfred Spraul	00e145b6d5	[PATCH] slab: removes local_irq_save()/local_irq_restore() pair Proposed by and based on a patch from Eric Dumazet <dada1@cosmosbay.com>: This patch removes unnecessary critical section in ksize() function, as cli/sti are rather expensive on modern CPUS. It additionally adds a docbook entry for ksize() and further simplifies the code. Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:49 -07:00
Eric Dumazet	34342e863c	[PATCH] mm/slab.c: prefetchw the start of new allocated objects Mostobjects returned by __cache_alloc() will be written by the caller, (but not all callers want to write all the object, but just at the begining) prefetchw() tells the modern CPU to think about the future writes, ie start some memory transactions in advance. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:48 -07:00
Zachary Amsden	a600388d28	[PATCH] x86: ptep_clear optimization Add a new accessor for PTEs, which passes the full hint from the mmu_gather struct; this allows architectures with hardware pagetables to optimize away atomic PTE operations when destroying an address space. Removing the locked operation should allow better pipelining of memory access in this loop. I measured an average savings of 30-35 cycles per zap_pte_range on the first 500 destructions on Pentium-M, but I believe the optimization would win more on older processors which still assert the bus lock on xchg for an exclusive cacheline. Update: I made some new measurements, and this saves exactly 26 cycles over ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180 cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a full address space destruction. pte_clear_full is not yet used, but is provided for future optimizations (in particular, when running inside of a hypervisor that queues page table updates, the full hint allows us to avoid queueing unnecessary page table update for an address space in the process of being destroyed. This is not a huge win, but it does help a bit, and sets the stage for further hypervisor optimization of the mm layer on all architectures. Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Christoph Lameter <christoph@lameter.com> Cc: <linux-mm@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:48 -07:00
Kyle Moffett	fa5b08d5f8	[PATCH] sab: consolidate kmem_bufctl_t This is used only in slab.c and each architecture gets to define whcih underlying type is to be used. Seems a bit silly - move it to slab.c and use the same type for all architectures: unsigned int. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:48 -07:00
Adam Litke	7bf07f3d4b	[PATCH] hugetlb: move stale pte check into huge_pte_alloc() Initial Post (Wed, 17 Aug 2005) This patch moves the if (! pte_none(*pte)) hugetlb_clean_stale_pgtable(pte); logic into huge_pte_alloc() so all of its callers can be immune to the bug described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246 > It turns out there is a bug in hugetlb_prefault(): with 3 level page table, > huge_pte_alloc() might return a pmd that points to a PTE page. It happens > if the virtual address for hugetlb mmap is recycled from previously used > normal page mmap. free_pgtables() might not scrub the pmd entry on > munmap and hugetlb_prefault skips on any pmd presence regardless what type > it is. Unless I am missing something, it seems more correct to place the check inside huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated. It also allows checking for this condition when lazily faulting huge pages later in the series. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: <linux-mm@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:46 -07:00
Deepak Saxena	fd195c49fb	[PATCH] arm: allow for arch-specific IOREMAP_MAX_ORDER Version 6 of the ARM architecture introduces the concept of 16MB pages (supersections) and 36-bit (40-bit actually, but nobody uses this) physical addresses. 36-bit addressed memory and I/O and ARMv6 can only be mapped using supersections and the requirement on these is that both virtual and physical addresses be 16MB aligned. In trying to add support for ioremap() of 36-bit I/O, we run into the issue that get_vm_area() allows for a maximum of 512K alignment via the IOREMAP_MAX_ORDER constant. To work around this, we can: - Allocate a larger VM area than needed (size + (1ul << IOREMAP_MAX_ORDER)) and then align the pointer ourselves, but this ends up with 512K of wasted VM per ioremap(). - Provide a new __get_vm_area_aligned() API and make __get_vm_area() sit on top of this. I did this and it works but I don't like the idea adding another VM API just for this one case. - My preferred solution which is to allow the architecture to override the IOREMAP_MAX_ORDER constant with it's own version. Signed-off-by: Deepak Saxena <dsaxena@plexity.net> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:46 -07:00
Paolo 'Blaisorblade' Giarrusso	4944e76d81	[PATCH] mm: remove implied vm_ops check If !vma->vm-ops we already BUG above, so retesting it is useless. The compiler cannot optimize this because BUG is a macro and is not thus marked noreturn; that should possibly be fixed. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:45 -07:00
Paolo 'Blaisorblade' Giarrusso	d44ed4f868	[PATCH] shmem_populate: avoid an useless check, and some comments Either shmem_getpage returns a failure, or it found a page, or it was told it couldn't do any I/O. So it's useless to check nonblock in the else branch. We could add a BUG() there but I preferred to comment the offending function. This was taken out from one Ingo Molnar's old patch I'm resurrecting. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Ingo Molnar <mingo@elte.hu> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:45 -07:00
Martin Hicks	0abf40c1ac	[PATCH] vm: slab.c spelling correction Fix a small spelling mistake. subtile->subtle Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:45 -07:00
Hugh Dickins	836d5ffd34	[PATCH] mm: fix madvise vma merging Better late than never, I've at last reviewed the madvise vma merging going into 2.6.13. Remove a pointless check and fix two little bugs - a simple test (with /proc/<pid>/maps hacked to show ReadHints) showed both mismerges in practice: though being madvise, neither was disastrous. 1. Correct placement of the success label in madvise_behavior: as in mprotect_fixup and mlock_fixup, it is necessary to update vm_flags when vma_merge succeeds (to handle the exceptional Case 8 noted in the comments above vma_merge itself). 2. Correct initial value of prev when starting part way into a vma: as in sys_mprotect and do_mlock, it needs to be set to vma in this case (vma_merge handles only that minimum of cases shown in its comments). 3. If find_vma_prev sets prev, then the vma it returns is prev->vm_next, so it's pointless to make that same assignment again in sys_madvise. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Martin Hicks	53e9a6159f	[PATCH] VM: zone reclaim atomic ops cleanup Christoph Lameter and Marcelo Tosatti asked to get rid of the atomic_inc_and_test() to cleanup the atomic ops in the zone reclaim code. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Martin Hicks	bce5f6ba34	[PATCH] VM: add capabilites check to set_zone_reclaim Add a capability check to sys_set_zone_reclaim(). This syscall is not something that should be available to a user. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Nick Piggin	242e546862	[PATCH] mm: remove atomic This bitop does not need to be atomic because it is performed when there will be no references to the page (ie. the page is being freed). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Nick Piggin	9a61c349b2	[PATCH] mm: remap ZERO_PAGE mappings filemap_xip's nopage routine maps the ZERO_PAGE into readonly mappings, if it has no data page to map there: then if the hole in the file is later filled, __xip_unmap uses an rmap technique to replace the ZERO_PAGEs mapped for that offset by the newly allocated file page, so that established mappings will see the newly written data. However, on MIPS (alone) there's not one but as many as eight ZERO_PAGEs, chosen for coloring by user virtual address; and if mremap has meanwhile been used to move a mapping containing a ZERO_PAGE, it will generally not match the ZERO_PAGE(address) __xip_unmap is looking for. To maintain XIP's established mappings correctly on MIPS, we need Nick's fix to mremap's move_one_page (originally presented as an optimization), to replace the ZERO_PAGE appropriate to the old address by the ZERO_PAGE appropriate to the new address. (But when I first saw this, I was thinking the ZERO_PAGEs themselves would get corrupted, very bad. Now I think it's the other way round, that the established mappings will fail to see the newly written data: incorrect, but not corrupting everything else. Whether filemap_xip's technique is generally safe, I'd hesitate to say in a hurry: it's interesting, but we've never tried to do that in tmpfs.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Nick Piggin	4d7670e0f6	[PATCH] mm: cleanup rmap Thanks to Bill Irwin for pointing this out. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:43 -07:00
Nick Piggin	2822c1aa57	[PATCH] mm: micro-optimise rmap Microoptimise page_add_anon_rmap. Although these expressions are used only in the taken branch of the if() statement, the compiler can't reorder them inside because atomic_inc_and_test is a barrier. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:43 -07:00
Nick Piggin	c3dce2d89c	[PATCH] mm: comment rmap Just be clear that VM_RESERVED pages here are a bug, and the test is not there because they are expected. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:43 -07:00
Christoph Lameter	6e21c8f145	[PATCH] /proc/<pid>/numa_maps to show on which nodes pages reside This patch was recently discussed on linux-mm: http://marc.theaimsgroup.com/?t=112085728500002&r=1&w=2 I inherited a large code base from Ray for page migration. There was a small patch in there that I find to be very useful since it allows the display of the locality of the pages in use by a process. I reworked that patch and came up with a /proc/<pid>/numa_maps that gives more information about the vma's of a process. numa_maps is indexes by the start address found in /proc/<pid>/maps. F.e. with this patch you can see the page use of the "getty" process: margin:/proc/12008 # cat maps 00000000-00004000 r--p 00000000 00:00 0 2000000000000000-200000000002c000 r-xp 00000000 08:04 516 /lib/ld-2.3.3.so 2000000000038000-2000000000040000 rw-p 00028000 08:04 516 /lib/ld-2.3.3.so 2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0 2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000260000-2000000000268000 ---p 00208000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0 2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923 /usr/lib/locale/en_US.utf8/LC_CTYPE 2000000000300000-2000000000308000 r--s 00000000 08:04 60071467 /usr/lib/gconv/gconv-modules.cache 2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0 4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399 /sbin/mingetty 6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399 /sbin/mingetty 6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0 [heap] 60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0 60000ffffff44000-60000ffffff98000 rw-p 60000ffffff44000 00:00 0 [stack] a000000000000000-a000000000020000 ---p 00000000 00:00 0 [vdso] cat numa_maps 2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2 2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2 2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16 2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2 2000000000274000 default MaxRef=1 Pages=3 Mapped=3 Anon=3 N0=3 2000000000280000 default MaxRef=8 Pages=3 Mapped=3 N0=3 2000000000300000 default MaxRef=8 Pages=2 Mapped=2 N0=2 2000000000318000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N2=1 4000000000000000 default MaxRef=6 Pages=2 Mapped=2 N1=2 6000000000004000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 6000000000008000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 60000fff7fffc000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 60000ffffff44000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 getty uses ld.so. The first vma is the code segment which is used by 43 other processes and the pages are evenly distributed over the 4 nodes. The second vma is the process specific data portion for ld.so. This is only one page. The display format is: <startaddress> Links to information in /proc/<pid>/map <memory policy> This can be "default" "interleave={}", "prefer=<node>" or "bind={<zones>}" MaxRef= <maximum reference to a page in this vma> Pages= <Nr of pages in use> Mapped= <Nr of pages with mapcount > Anon= <nr of anonymous pages> Nx= <Nr of pages on Node x> The content of the proc-file is self-evident. If this would be tied into the sparsemem system then the contents of this file would not be too useful. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:43 -07:00
Hugh Dickins	839b9685e8	[PATCH] rmap: don't test rss Remove the three get_mm_counter(mm, rss) tests from rmap.c: there was a time when testing rss was important to avoid a particular race between dup_mmap and the anonmm rmap; but now it's just a rather silly pseudo- optimization, made even more obscure by the get_mm_counter macro. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:42 -07:00
Hugh Dickins	3279ffd97f	[PATCH] delete from_swap_cache BUG_ONs Three of the four BUG_ONs in delete_from_swap_cache are immediately repeated in __delete_from_swap_cache: delete those and add the one. But perhaps mm/ is altogether overprovisioned with historic BUGs? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:42 -07:00
Hugh Dickins	5d337b9194	[PATCH] swap: swap_lock replace list+device The idea of a swap_device_lock per device, and a swap_list_lock over them all, is appealing; but in practice almost every holder of swap_device_lock must already hold swap_list_lock, which defeats the purpose of the split. The only exceptions have been swap_duplicate, valid_swaphandles and an untrodden path in try_to_unuse (plus a few places added in this series). valid_swaphandles doesn't show up high in profiles, but swap_duplicate does demand attention. However, with the hold time in get_swap_pages so much reduced, I've not yet found a load and set of swap device priorities to show even swap_duplicate benefitting from the split. Certainly the split is mere overhead in the common case of a single swap device. So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock (generally we seem to prefer an _ in the name, and not hide in a macro). If someone can show a regression in swap_duplicate, then probably we should add a hashlock for the swap_map entries alone (shorts being anatomic), so as to help the case of the single swap device too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:42 -07:00
Hugh Dickins	048c27fd72	[PATCH] swap: scan_swap_map latency breaks The get_swap_page/scan_swap_map latency can be so bad that even those without preemption configured deserve relief: periodically cond_resched. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	52b7efdbe5	[PATCH] swap: scan_swap_map drop swap_device_lock get_swap_page has often shown up on latency traces, doing lengthy scans while holding two spinlocks. swap_list_lock is already dropped, now scan_swap_map drop swap_device_lock before scanning the swap_map. While scanning for an empty cluster, don't worry that racing tasks may allocate what was free and free what was allocated; but when allocating an entry, check it's still free after retaking the lock. Avoid dropping the lock in the expected common path. No barriers beyond the locks, just let the cookie crumble; highest_bit limit is volatile, but benign. Guard against swapoff: must check SWP_WRITEOK before allocating, must raise SWP_SCANNING reference count while in scan_swap_map, swapoff wait for that to fall - just use schedule_timeout, we don't want to burden scan_swap_map itself, and it's very unlikely that anyone can really still be in scan_swap_map once swapoff gets this far. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	7dfad4183b	[PATCH] swap: scan_swap_map restyled Rewrite scan_swap_map to allocate in just the same way as before (taking the next free entry SWAPFILE_CLUSTER-1 times, then restarting at the lowest wholly empty cluster, falling back to lowest entry if none), but with a view towards dropping the lock in the next patch. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	fb4f88dcab	[PATCH] swap: get_swap_page drop swap_list_lock Rewrite get_swap_page to allocate in just the same sequence as before, but without holding swap_list_lock across its scan_swap_map. Decrement nr_swap_pages and update swap_list.next in advance, while still holding swap_list_lock. Skip full devices by testing highest_bit. Swapoff hold swap_device_lock as well as swap_list_lock to clear SWP_WRITEOK. Reduces lock contention when there are parallel swap devices of the same priority. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	89d09a2c80	[PATCH] swap: freeing update swap_list.next This makes negligible difference in practice: but swap_list.next should not be updated to a higher prio in the general helper swap_info_get, but rather in swap_entry_free; and then only in the case when entry is actually freed. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	6eb396dc4a	[PATCH] swap: swap unsigned int consistency The swap header's unsigned int last_page determines the range of swap pages, but swap_info has been using int or unsigned long in some cases: use unsigned int throughout (except, in several places a local unsigned long is useful to avoid overflows when adding). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	53092a7402	[PATCH] swap: show span of swap extents The "Adding %dk swap" message shows the number of swap extents, as a guide to how fragmented the swapfile may be. But a useful further guide is what total extent they span across (sometimes scarily large). And there's no need to keep nr_extents in swap_info: it's unused after the initial message, so save a little space by keeping it on stack. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:40 -07:00
Hugh Dickins	11d31886db	[PATCH] swap: swap extent list is ordered There are several comments that swap's extent_list.prev points to the lowest extent: that's not so, it's extent_list.next which points to it, as you'd expect. And a couple of loops in add_swap_extent which go all the way through the list, when they should just add to the other end. Fix those up, and let map_swap_page search the list forwards: profiles shows it to be twice as quick that way - because prefetch works better on how the structs are typically kmalloc'ed? or because usually more is written to than read from swap, and swap is allocated ascendingly? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:40 -07:00
Hugh Dickins	4cd3bb10ff	[PATCH] swap: move destroy_swap_extents calls sys_swapon's call to destroy_swap_extents on failure is made after the final swap_list_unlock, which is faintly unsafe: another sys_swapon might already be setting up that swap_info_struct. Calling it earlier, before taking swap_list_lock, is safe. sys_swapoff's call to destroy_swap_extents was safe, but likewise move it earlier, before taking the locks (once try_to_unuse has completed, nothing can be needing the swap extents). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:40 -07:00
Hugh Dickins	e2244ec2ef	[PATCH] swap: correct swapfile nr_good_pages If a regular swapfile lies on a filesystem whose blocksize is less than PAGE_SIZE, then setup_swap_extents may have to cut the number of usable swap pages; but sys_swapon's nr_good_pages was not expecting that. Also, setup_swap_extents takes no account of badpages listed in the swap header: not worth doing so, but ensure nr_badpages is 0 for a regular swapfile. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:40 -07:00
Hugh Dickins	b0d9bcd4bb	[PATCH] swap: update swapfile i_sem comment Update swap extents comment: nowadays we guard with S_SWAPFILE not i_sem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:40 -07:00
Dave Hansen	28ae55c98e	[PATCH] sparsemem extreme: hotplug preparation This splits up sparse_index_alloc() into two pieces. This is needed because we'll allocate the memory for the second level in a different place from where we actually consume it to keep the allocation from happening underneath a lock Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:38 -07:00
Bob Picco	3e347261a8	[PATCH] sparsemem extreme implementation With cleanups from Dave Hansen <haveblue@us.ibm.com> SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to mem_sections. This two level layout scheme is able to achieve smaller memory requirements for SPARSEMEM with the tradeoff of an additional shift and load when fetching the memory section. The current SPARSEMEM implementation is a one dimensional array of mem_sections which is the default SPARSEMEM configuration. The patch attempts isolates the implementation details of the physical layout of the sparsemem section array. SPARSEMEM_EXTREME requires bootmem to be functioning at the time of memory_present() calls. This is not always feasible, so architectures which do not need it may allocate everything statically by using SPARSEMEM_STATIC. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:38 -07:00
Bob Picco	802f192e4a	[PATCH] SPARSEMEM EXTREME A new option for SPARSEMEM is ARCH_SPARSEMEM_EXTREME. Architecture platforms with a very sparse physical address space would likely want to select this option. For those architecture platforms that don't select the option, the code generated is equivalent to SPARSEMEM currently in -mm. I'll be posting a patch on ia64 ml which uses this new SPARSEMEM feature. ARCH_SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to mem_sections. This two level layout scheme is able to achieve smaller memory requirements for SPARSEMEM with the tradeoff of an additional shift and load when fetching the memory section. The current SPARSEMEM -mm implementation is a one dimensional array of mem_sections which is the default SPARSEMEM configuration. The patch attempts isolates the implementation details of the physical layout of the sparsemem section array. ARCH_SPARSEMEM_EXTREME depends on 64BIT and is by default boolean false. I've boot tested under aim load ia64 configured for ARCH_SPARSEMEM_EXTREME. I've also boot tested a 4 way Opteron machine with !ARCH_SPARSEMEM_EXTREME and tested with aim. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:38 -07:00
Nick Piggin	d992895ba2	[PATCH] Lazy page table copies in fork() Defer copying of ptes until fault time when it is possible to reconstruct the pte from backing store. Idea from Andi Kleen and Nick Piggin. Thanks to input from Rik van Riel and Linus and to Hugh for correcting my blundering. Ray Fucillo <fucillo@intersystems.com> reports: "I applied this latest patch to a 2.6.12 kernel and found that it does resolve the problem. Prior to the patch on this machine, I was seeing about 23ms spent in fork for ever 100MB of shared memory segment. After applying the patch, fork is taking about 1ms regardless of the shared memory size." Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-29 17:25:04 -07:00
Linus Torvalds	cc314eef01	Fix nasty ncpfs symlink handling bug. This bug could cause oopses and page state corruption, because ncpfs used the generic page-cache symlink handlign functions. But those functions only work if the page cache is guaranteed to be "stable", ie a page that was installed when the symlink walk was started has to still be installed in the page cache at the end of the walk. We could have fixed ncpfs to not use the generic helper routines, but it is in many ways much cleaner to instead improve on the symlink walking helper routines so that they don't require that absolute stability. We do this by allowing "follow_link()" to return a error-pointer as a cookie, which is fed back to the cleanup "put_link()" routine. This also simplifies NFS symlink handling. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-19 18:02:56 -07:00
David Gibson	c7546f8f03	[PATCH] Fix hugepage crash on failing mmap() This patch fixes a crash in the hugepage code. unmap_hugepage_area() was assuming that (due to prefault) PTEs must exist for all the area in question. However, this may not be the case, if mmap() encounters an error before the prefault and calls unmap_region() to clean up any partial mapping. Depending on the hugepage configuration, this crash can be triggered by an unpriveleged user. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-05 12:22:37 -07:00
Simon Derr	2f60f8d357	[PATCH] __vm_enough_memory() signedness fix We have found what seems to be a small bug in __vm_enough_memory() when sysctl_overcommit_memory is set to OVERCOMMIT_NEVER. When this bug occurs the systems fails to boot, with /sbin/init whining about fork() returning ENOMEM. We hunted down the problem to this: The deferred update mecanism used in vm_acct_memory(), on a SMP system, allows the vm_committed_space counter to have a negative value. This should not be a problem since this counter is known to be inaccurate. But in __vm_enough_memory() this counter is compared to the `allowed' variable, which is an unsigned long. This comparison is broken since it will consider the negative values of vm_committed_space to be huge positive values, resulting in a memory allocation failure. Signed-off-by: <Jean-Marc.Saffroy@ext.bull.net> Signed-off-by: <Simon.Derr@bull.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-04 21:43:14 -07:00
Hugh Dickins	1c5ad84516	[PATCH] fix VmSize and VmData after mremap mremap's move_vma is applying __vm_stat_account to the old vma which may have already been freed: move it to just before the do_munmap. mremapping to and fro with CONFIG_DEBUG_SLAB=y showed /proc/<pid>/status VmSize and VmData wrapping just like in kernel bugzilla #4842, and fixed by this patch - worth including in 2.6.13, though not yet confirmed that it fixes that specific report from Frank van Maarseveen. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-04 13:11:15 -07:00
Linus Torvalds	a68d2ebc15	Fix up recent get_user_pages() handling The VM_FAULT_WRITE thing is an extra bit, not a valid return value, and has to be treated as such by get_user_pages(). Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-03 10:07:09 -07:00
Nick Piggin	f33ea7f404	[PATCH] fix get_user_pages bug Checking pte_dirty instead of pte_write in __follow_page is problematic for s390, and for copy_one_pte which leaves dirty when clearing write. So revert __follow_page to check pte_write as before, and make do_wp_page pass back a special extra VM_FAULT_WRITE bit to say it has done its full job: once get_user_pages receives this value, it no longer requires pte_write in __follow_page. But most callers of handle_mm_fault, in the various architectures, have switch statements which do not expect this new case. To avoid changing them all in a hurry, make an inline wrapper function (using the old name) that masks off the new bit, and use the extended interface with double underscores. Yes, we do have a call to do_wp_page from do_swap_page, but no need to change that: in rare case it's needed, another do_wp_page will follow. Signed-off-by: Hugh Dickins <hugh@veritas.com> [ Cleanups by Nick Piggin ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-03 09:12:05 -07:00
Eric Dumazet	ba17101b41	[PATCH] sys_set_mempolicy() doesnt check if mode < 0 A kernel BUG() is triggered by a call to set_mempolicy() with a negative first argument. This is because the mode is declared as an int, and the validity check doesnt check < 0 values. Alternatively, mode could be declared as unsigned int or unsigned long. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-01 21:38:00 -07:00
Hugh Dickins	690dbe1ced	[PATCH] x86_64: access of some bad address x86_64 has a large sparse gate area between VSYSCALL_START and VSYSCALL_END, not all of it presently backed by pmds. Alexander Nyberg has found that in some circumstances gdb may try to ptrace here, and hit get_user_pages BUG_ON. It seems odd that gdb should be accessing here, but it certainly shouldn't crash in this way: relax BUG_ON to -EFAULT. Fixes kernel bugzilla #4801. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-01 21:38:00 -07:00
Linus Torvalds	4ceb5db975	Fix get_user_pages() race for write access There's no real guarantee that handle_mm_fault() will always be able to break a COW situation - if an update from another thread ends up modifying the page table some way, handle_mm_fault() may end up requiring us to re-try the operation. That's normally fine, but get_user_pages() ended up re-trying it as a read, and thus a write access could in theory end up losing the dirty bit or be done on a page that had not been properly COW'ed. This makes get_user_pages() always retry write accesses as write accesses by making "follow_page()" require that a writable follow has the dirty bit set. That simplifies the code and solves the race: if the COW break fails for some reason, we'll just loop around and try again. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-08-01 11:14:49 -07:00
Martin J. Bligh	e310fd4325	[PATCH] Fix NUMA node sizing in nr_free_zone_pages We are iterating over all nodes in nr_free_zone_pages(). Because the fallback zonelists contain all nodes in the system, and we walk all the zonelists, we're counting memory multiple times (once for each node). This caused us to make a size estimate of 32GB for an 8GB AMD64 box, which makes all the dirty ratio calculations, etc incorrect. There's still a further bug to fix from e820 holes causing overestimation as well, but this fix is separate, and good as is, and fixes one class of problems. Problem found by Badari, and tested by Ram Pai - thanks! Signed-off-by: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: Matt Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-30 10:14:46 -07:00
Andy Whitcroft	12b1c5f382	[PATCH] Remove bogus warning in page_alloc.c Originally __free_pages_bulk used the relative page number within a zone to define its buddies. This meant that to maintain the "maximally aligned" requirements (that an allocation of size N will be aligned at least to N physically) zones had to also be aligned to 1<<MAX_ORDER pages. When __free_pages_bulk was updated to use the relative page frame numbers of the free'd pages to pair buddies this released the alignment constraint on the 'left' edge of the zone. This allows _either_ edge of the zone to contain partial MAX_ORDER sized buddies. These simply never will have matching buddies and thus will never make it to the 'top' of the pyramid. The patch below removes a now redundant check ensuring that the mem_map was aligned to MAX_ORDER. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-27 16:25:54 -07:00
suzuki	165cd40235	[PATCH] madvise() does not always return -EBADF on non-file mapped area The madvise() system call returns -EBADF for areas which does not map to files, only for behaviour request MADV_WILLNEED. According to man pages, madvise returns : EBADF - the map exists, but the area maps something that isn't a file. Fixes bug 2995. Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-27 16:25:54 -07:00
Andrew Morton	1aaf18ff9d	[PATCH] check_user_page_readable() deadlock fix Fix bug identifued by Richard Purdie <rpurdie@rpsys.net>. oprofile calls check_user_page_readable() from interrupt context, so we deadlock over various VFS locks. But check_user_page_readable() doesn't imply either a read or a write of the page's contents. Change __follow_page() so that check_user_page_readable() can tell __follow_page() that we're not accessing the page's contents, and use that info to avoid the troublesome lock-takings. Also, make follow_page() inline for the single callsite in memory.c to save a bit of stack space. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-27 16:25:53 -07:00
Andi Kleen	90c5029e47	[PATCH] Undo mempolicy shared policy rbtree microoptimization All mempolicy changes must be inside the spinlock and readding the rb_erase prevents a crash while doing: > echo "1" > /tmp/numatest > numactl --length=0x4000 --shm /tmp/numatest --localalloc > numactl --length=0x2000 --offset=0 --shm /tmp/numatest --membind=0 > numactl --length=0x2000 --offset=0x2000 --shm /tmp/numatest --membind=1 > ipcs > ipcrm -M "the_key_value_of_this_shm_area" Based on a patch by John Blackwood Cc: <john.blackwood@ccur.com> Cc: <andrea@suse.de> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-27 16:25:52 -07:00
Carsten Otte	afa597ba20	[PATCH] execute-in-place fixes This patch includes feedback from Andrew and Christoph. Thanks for taking time to review. Use of empty_zero_page was eliminated to fix compilation for architectures that don't have it. This patch removes setting pages up-to-date in ext2_get_xip_page and all bug checks to verify that the page is indeed up to date. Setting the page state on mapping to userland is bogus. None of the code patchs involved with these pages in mm cares about the page state. still on my ToDo list: identify a place outside second extended where __inode_direct_access should reside Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-15 09:54:50 -07:00
Geert Uytterhoeven	082ff0a999	[PATCH] mm/filemap_xip.c compilation fix mm/filemap_xip.c: In function `__xip_unmap': mm/filemap_xip.c:194: request for member `pte' in something not a structure or union Apparently pte_pfn() takes a pte_t, not a pointer to a pte_t. From looking at asm/page.h, it seems to be the same on ia32 or ppc (iff STRICT_MM_TYPECHECKS is enabled, which is disabled by default on ppc). Acked-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-12 16:01:00 -07:00
Alexey Dobriyan	0db925af1d	[PATCH] propagate __nocast annotations Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-07 18:23:46 -07:00
Anton Blanchard	42639269f9	[PATCH] mm: quieten OOM killer noise We now print statistics when invoking the OOM killer, however this information is not rate limited and you can get into situations where the console is continually spammed. For example, when a task is exiting the OOM killer will simply return (waiting for that task to exit and clear up memory). If the VM continually calls back into the OOM killer we get thousands of copies of show_mem() on the console. Use printk_ratelimit() to quieten it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-07 18:23:36 -07:00
Marcelo Tosatti	37b173a4d0	[PATCH] remove completly bogus comment inside __alloc_pages() try_to_free_pages handling Remove completly bogus comment from did_some_progress != 0 handling (that same comment is a few lines below on did_some_progress = 0 case, where it belongs). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-07 18:23:35 -07:00
Marcelo Tosatti	79b9ce311e	[PATCH] print order information when OOM killing Dump the current allocation order when OOM killing. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-07 18:23:35 -07:00
Christoph Lameter	83b78bd2d3	[PATCH] Fix broken kmalloc_node in rc1/rc2 This patch used to be in Andrew's tree before the NUMA slab allocator went in. Either this patch or the NUMA slab allocator is needed in order for kmalloc_node to work correctly. pcibus_to_node may be used to generate the node information passed to kmalloc_node. pcibus_to_node returns -1 if it was not able to determine on which node a pcibus is located. For that case kmalloc_node must work like kmalloc. Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-07-06 10:52:45 -07:00
Pekka J Enberg	687a21cee1	[PATCH] rename wakeup_bdflush to wakeup_pdflush Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-28 21:20:31 -07:00
Bob Picco	3212c6be25	[PATCH] fix WANT_PAGE_VIRTUAL in memmap_init I spotted this issue while in memmap_init last week. I can't say the change has any test coverage by me. start_pfn was formerly used in main "for" loop. The fix is replace start_pfn with pfn. Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-27 15:11:42 -07:00
Linus Torvalds	2031d0f586	Merge Christoph's freeze cleanup patch	2005-06-25 17:16:53 -07:00
Christoph Lameter	3e1d1d28d9	[PATCH] Cleanup patch for process freezing 1. Establish a simple API for process freezing defined in linux/include/sched.h: frozen(process) Check for frozen process freezing(process) Check if a process is being frozen freeze(process) Tell a process to freeze (go to refrigerator) thaw_process(process) Restart process frozen_process(process) Process is frozen now 2. Remove all references to PF_FREEZE and PF_FROZEN from all kernel sources except sched.h 3. Fix numerous locations where try_to_freeze is manually done by a driver 4. Remove the argument that is no longer necessary from two function calls. 5. Some whitespace cleanup 6. Clear potential race in refrigerator (provides an open window of PF_FREEZE cleared before setting PF_FROZEN, recalc_sigpending does not check PF_FROZEN). This patch does not address the problem of freeze_processes() violating the rule that a task may only modify its own flags by setting PF_FREEZE. This is not clean in an SMP environment. freeze(process) is therefore not SMP safe! Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-25 17:10:13 -07:00
Nick Wilson	8c0e33c133	[PATCH] Use ALIGN to remove duplicate code This patch makes use of ALIGN() to remove duplicate round-up code. Signed-off-by: Nick Wilson <njw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-25 16:25:02 -07:00
Vivek Goyal	92aa63a5a1	[PATCH] kdump: Retrieve saved max pfn This patch retrieves the max_pfn being used by previous kernel and stores it in a safe location (saved_max_pfn) before it is overwritten due to user defined memory map. This pfn is used to make sure that user does not try to read the physical memory beyond saved_max_pfn. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-25 16:24:52 -07:00
Badari Pulavarty	b0cfbd995d	[PATCH] fix for generic_file_write iov problem Here is the fix for the problem described in http://bugzilla.kernel.org/show_bug.cgi?id=4721 Basically, problem is generic_file_buffered_write() is accessing beyond end of the iov[] vector after handling the last vector. If we happen to cross page boundary, we get a fault. I think this simple patch is good enough. If we really don't want to depend on the "count", then we need pass nr_segs to filemap_set_next_iovec() and decrement it and check it. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-25 16:24:39 -07:00
Pavel Machek	648be31881	[PATCH] swsusp: kill config_pm_disk CONFIG_PM_DISK is long gone, but it still managed to survived at few places. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-25 16:24:32 -07:00
Hugh Dickins	2d15cab85b	[PATCH] mm: fix remap_pte_range BUG Out-of-tree user of remap_pfn_range hit kernel BUG at mm/memory.c:1112! It passes an unrounded size to remap_pfn_range, which was okay before 2.6.12, but misses remap_pte_range's new end condition. An audit of all the other ptwalks confirms that this is the only one so exposed. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-25 16:24:26 -07:00
Hifumi Hisashi	1e8a81c5a3	[PATCH] Fix the error handling in direct I/O Fix a bug on error handling in the direct I/O function. Currently, if a file is opened with the O_DIRECT\|O_SYNC flag, the write() syscall cannot receive the EIO error after an I/O error (SCSI cable is disconnected etc.). Return values of other points that call generic_osync_inode() are treated appropriately. Signed-off-by: Hisashi Hifumi <hifumi.hisashi@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-25 16:24:25 -07:00
Carsten Otte	fe77ba6f4f	[PATCH] xip: madvice/fadvice: execute in place Make sys_madvice/fadvice return sane with xip. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-24 00:06:42 -07:00
Carsten Otte	eb6fe0c388	[PATCH] xip: reduce code duplication This patch reworks filemap_xip.c with the goal to reduce code duplication from mm/filemap.c. It applies agains 2.6.12-rc6-mm1. Instead of implementing the aio functions, this one implements the synchronous read/write functions only. For readv and writev, the generic fallback is used. For aio, we rely on the application doing the fallback. Since our "synchronous" function does memcpy immediately anyway, there is no performance difference between using the fallbacks or implementing each operation. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-24 00:06:41 -07:00
Carsten Otte	ceffc07852	[PATCH] xip: fs/mm: execute in place - generic_file* file operations do no longer have a xip/non-xip split - filemap_xip.c implements a new set of fops that require get_xip_page aop to work proper. all new fops are exported GPL-only (don't like to see whatever code use those except GPL modules) - __xip_unmap now uses page_check_address, which is no longer static in rmap.c, and defined in linux/rmap.h - mm/filemap.h is now much more clean, plainly having just Linus' inline funcs moved here from filemap.c - fix includes in filemap_xip to make it build cleanly on i386 Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-24 00:06:41 -07:00
Martin Waitz	3d41088fa3	[PATCH] DocBook: update comments This patch updates some comments to match code changes. Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-24 00:06:40 -07:00
Christoph Lameter	45778ca819	[PATCH] Remove f_error field from struct file The following patch removes the f_error field and all checks of f_error. Trond said: f_error was introduced for NFS, and made sense when we were guaranteed always to have a file pointer around when write errors occurred. Since then, we have (for various reasons) had to introduce the nfs_open_context in order to track the file read/write state, and it made sense to move our f_error tracking there too. Signed-off-by: Christoph Lameter <christoph@lameter.com> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:33 -07:00
Benjamin LaHaise	01890a4c12	[PATCH] mempool - only init waitqueue in slow path Here's a small patch to improve the performance of mempool_alloc by only initializing the wait queue when we're about to wait. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:29 -07:00
Pekka Enberg	3bc1ee3e8f	[PATCH] remove redundant vm_flags clearing from madvise.c This patch removes redundant VM_ClearReadHint from mm/madvice.c which was left there by Prasanna's patch. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:19 -07:00
Paulo Marques	543537bd92	[PATCH] create a kstrdup library function This patch creates a new kstrdup library function and changes the "local" implementations in several places to use this function. Most of the changes come from the sound and net subsystems. The sound part had already been acknowledged by Takashi Iwai and the net part by David S. Miller. I left UML alone for now because I would need more time to read the code carefully before making changes there. Signed-off-by: Paulo Marques <pmarques@grupopie.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:18 -07:00
Christoph Lameter	1946089a10	[PATCH] NUMA aware block device control structure allocation Patch to allocate the control structures for for ide devices on the node of the device itself (for NUMA systems). The patch depends on the Slab API change patch by Manfred and me (in mm) and the pcidev_to_node patch that I posted today. Does some realignment too. Signed-off-by: Justin M. Forbes <jmforbes@linuxtx.org> Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Pravin Shelar <pravin@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:09 -07:00
Andy Whitcroft	29751f6991	[PATCH] sparsemem hotplug base Make sparse's initalization be accessible at runtime. This allows sparse mappings to be created after boot in a hotplug situation. This patch is separated from the previous one just to give an indication how much of the sparse infrastructure is just for hotplug memory. The section_mem_map doesn't really store a pointer. It stores something that is convenient to do some math against to get a pointer. It isn't valid to just do section_mem_map, so I don't think it should be stored as a pointer. There are a couple of things I'd like to store about a section. First of all, the fact that it is !NULL does not mean that it is present. There could be such a combination where section_mem_map is* NULL, but the math gets you properly to a real mem_map. So, I don't think that check is safe. Since we're storing 32-bit-aligned structures, we have a few bits in the bottom of the pointer to play with. Use one bit to encode whether there's really a mem_map there, and the other one to tell whether there's a valid section there. We need to distinguish between the two because sometimes there's a gap between when a section is discovered to be present and when we can get the mem_map for it. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:05 -07:00
Andy Whitcroft	641c767389	[PATCH] sparsemem swiss cheese numa layouts The part of the sparsemem patch which modifies memmap_init_zone() has recently become a problem. It changes behavior so that there is a call to pfn_to_page() for each individual page inside of a node's range: node_start_pfn through node_end_pfn. It used to simply do this once, at the beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside of a node made it necessary to change. Mike Kravetz recently wrote a patch which made the NUMA code accept some new kinds of layouts. The system's memory was laid out like this, with node 0's memory in two pieces: one before and one after node 1's memory: Node 0: +++++ +++++ Node 1: +++++ Previous behavior before Mike's patch was to assign nodes like this: Node 0: 00000 XXXXX Node 1: 11111 Where the 'X' areas were simply thrown away. The new behavior was to make the pg_data_t span node 0 across all of its areas, including areas that are really node 1's: Node 0: 000000000000000 Node 1: 11111 This wastes a little bit of mem_map space, but ends up being OK, and more fully utilizes the system's memory. memmap_init_zone() initializes all of the "struct page"s for node 0, even for the "hole", but those never get used, because there is no pfn_to_page() that resolves to those pages. However, only calling pfn_to_page() once, memmap_init_zone() always uses the pages that were allocated for node0->node_mem_map because: struct page *start = pfn_to_page(start_pfn); // effectively start = &node->node_mem_map[0] for (page = start; page < (start + size); page++) { init_page_here();... page++; } Slow, and wasteful, but generally harmless. But, modify that to call pfn_to_page() for each loop iteration (like sparsemem does): for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) { page = pfn_to_page(pfn); } And you end up trying to initialize node 1's pages too early, along with bogus data from node 0. This patch checks for those weird layouts and declines to touch the pages, making the more frequent pfn_to_page() calls OK to do. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:05 -07:00
Andy Whitcroft	d41dee369b	[PATCH] sparsemem memory model Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of mem_map[] is needed by discontiguous memory machines (like in the old CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually become a complete replacement. A significant advantage over DISCONTIGMEM is that it's completely separated from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA and DISCONTIG are often confused. Another advantage is that sparse doesn't require each NUMA node's ranges to be contiguous. It can handle overlapping ranges between nodes with no problems, where DISCONTIGMEM currently throws away that memory. Sparsemem uses an array to provide different pfn_to_page() translations for each SECTION_SIZE area of physical memory. This is what allows the mem_map[] to be chopped up. In order to do quick pfn_to_page() operations, the section number of the page is encoded in page->flags. Part of the sparsemem infrastructure enables sharing of these bits more dynamically (at compile-time) between the page_zone() and sparsemem operations. However, on 32-bit architectures, the number of bits is quite limited, and may require growing the size of the page->flags type in certain conditions. Several things might force this to occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of memory), an increase in the physical address space, or an increase in the number of used page->flags. One thing to note is that, once sparsemem is present, the NUMA node information no longer needs to be stored in the page->flags. It might provide speed increases on certain platforms and will be stored there if there is room. But, if out of room, an alternate (theoretically slower) mechanism is used. This patch introduces CONFIG_FLATMEM. It is used in almost all cases where there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM often have to compile out the same areas of code. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:04 -07:00
Andy Whitcroft	af705362ab	[PATCH] generify memory present Allow architectures to indicate that they will be providing hooks to indice installed memory areas, memory_present(). Provide prototypes for the i386 implementation. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:04 -07:00
Dave Hansen	785dcd44b6	[PATCH] mm/Kconfig: give DISCONTIG more help text This gives DISCONTIGMEM a bit more help text to explain what it does, not just when to choose it. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:03 -07:00
Dave Hansen	e1785e85b9	[PATCH] mm/Kconfig: hide "Memory Model" selection menu I got some feedback from users who think that the new "Memory Model" menu is a little invasive. This patch will hide that menu, except when CONFIG_EXPERIMENTAL is enabled or when an individual architecture wants it. An individual arch may want to enable it because they've removed their arch-specific DISCONTIG prompt in favor of the mm/Kconfig one. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:03 -07:00
Dave Hansen	44d0f805c7	[PATCH] sparsemem: fix minor "defaults" issue in mm/Kconfig The following patch applies on top of 2.6.12-rc2-mm1. It fixes a minor user interaction issue, and an early reference to SPARSEMEM. This "choice" menu would always default to FLATMEM, as it was listed first. Move it to the end so that the other defaults have a chance first. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:03 -07:00
Dave Hansen	93b7504e3e	[PATCH] Introduce new Kconfig option for NUMA or DISCONTIG There is some confusion that arose when working on SPARSEMEM patch between what is needed for DISCONTIG vs. NUMA. Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently. All of the current NUMA implementations require an implementation of DISCONTIG. Because of this, quite a lot of code which is really needed for NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n case. Introducing this new NEED_MULTIPLE_NODES config option allows code that is needed for both NUMA or DISCONTIG to be separated out from code that is specific to DISCONTIG. One great advantage of this approach is that it doesn't require every architecture to be converted over. All of the current implementations should "just work", only the ones implementing SPARSEMEM will have to be fixed up. The change to free_area_init() makes it work inside, or out of the new config option. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:03 -07:00
Dave Hansen	3a9da7655d	[PATCH] create mm/Kconfig for arch-independent memory options With sparsemem being introduced, we need a central place for new memory-related .config options: mm/Kconfig. This allows us to remove many of the duplicated arch-specific options. The new option, CONFIG_FLATMEM, is there to enable us to detangle NUMA and DISCONTIGMEM. This is a requirement for sparsemem because sparsemem uses the NUMA code without the presence of DISCONTIGMEM. The sparsemem patches use CONFIG_FLATMEM in generic code, so this patch is a requirement before applying them. Almost all places that used to do '#ifndef CONFIG_DISCONTIGMEM' should use '#ifdef CONFIG_FLATMEM' instead. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:01 -07:00
Dave Hansen	348f8b6c48	[PATCH] sparsemem base: reorganize page->flags bit operations Generify the value fields in the page_flags. The aim is to allow the location and size of these fields to be varied. Additionally we want to move away from fixed allocations per field whilst still enforcing the overall bit utilisation limits. We rely on the compiler to spot and optimise the accessor functions. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:01 -07:00
Dave Hansen	6f167ec721	[PATCH] sparsemem base: simple NUMA remap space allocator Introduce a simple allocator for the NUMA remap space. This space is very scarce, used for structures which are best allocated node local. This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep the pgdat->node_mem_map initialized in a consistent place for all architectures. Issues: o alloc_remap takes a node_id where we might expect a pgdat which was intended to allow us to allocate the pgdat's using this mechanism; which we do not yet do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:01 -07:00
Christoph Lameter	b7c84c6ada	[PATCH] boot_pageset must not be freed. The boot_pageset needs to be preserved for hotplugging and for off line processors and nodes. Otherwise pointers will point into memory that has now a different use. /proc/zoneinfo is currently showing strange results if processors / nodes are not present. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-22 20:42:32 -07:00
Denis Vlasenko	c0d62219a4	[PATCH] Kill stray newline OOM killer prints a stray newline. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:21 -07:00
Abhijit Karmarkar	b4955ce3dd	[PATCH] msync: check pte dirty earlier It's common practice to msync a large address range regularly, in which often only a few ptes have actually been dirtied since the previous pass. sync_pte_range then goes much faster if it tests whether pte is dirty before locating and accessing each struct page cacheline; and it is hardly slowed by ptep_clear_flush_dirty repeating that test in the opposite case, when every pte actually is dirty. But beware, s390's pte_dirty always says false, since its dirty bit is kept in the storage key, located via the struct page address. So skip this optimization in its case: use a pte_maybe_dirty macro which just says true if page_test_and_clear_dirty is implemented. Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:21 -07:00
Hugh Dickins	c475a8ab62	[PATCH] can_share_swap_page: use page_mapcount Remember that ironic get_user_pages race? when the raised page_count on a page swapped out led do_wp_page to decide that it had to copy on write, so substituted a different page into userspace. 2.6.7 onwards have Andrea's solution, where try_to_unmap_one backs out if it finds page_count raised. Which works, but is unsatisfying (rmap.c has no other page_count heuristics), and was found a few months ago to hang an intensive page migration test. A year ago I was hesitant to engage page_mapcount, now it seems the right fix. So remove the page_count hack from try_to_unmap_one; and use activate_page in unuse_mm when dropping lock, to replace its secondary effect of helping swapoff to make progress in that case. Simplify can_share_swap_page (now called only on anonymous pages) to check page_mapcount + page_swapcount == 1: still needs the page lock to stabilize their (pessimistic) sum, but does not need swapper_space.tree_lock for that. In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to keep sum on the high side, and correct when can_share_swap_page called. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:21 -07:00
Hugh Dickins	d296e9cd02	[PATCH] do_wp_page: cannot share file page A small optimization to do_wp_page's check for whether to avoid copy by reusing the page already mapped. It can never share a cached file page, nor can it share a reserved page (often the empty zero page), so it's a waste of time to lock and unlock in those cases. Which nowadays can both be neatly excluded by a preliminary PageAnon test. Christoph has reported that a preliminary page_count test proved valuable for scalability here, but PageAnon covers more common cases all at once. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:21 -07:00
Hugh Dickins	08ef472937	[PATCH] get_user_pages: kill get_page_map Since its birth, get_user_pages has been calling a misguided get_page_map function. follow_page has already returned NULL if the pfn is invalid, we cannot reach an invalid pfn from a validated struct page. Remove get_page_map, and the messy rewind in get_user_pages to cope with its failure. Oh, and could we please call that "struct page page" like everywhere else, instead of "struct page map"? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:21 -07:00
Hugh Dickins	334795eca4	[PATCH] bad_page: clear reclaim and slab Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page ought to clear them to avoid repetitive reports (Nikita noticed this too). Let prep_new_page check page_count and PG_slab as free_pages_check does. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:19 -07:00
Hugh Dickins	91612e0df2	[PATCH] mbind: check_range use standard ptwalk Strict mbind's check for currently mapped pages being on node has been using a slow loop which re-evaluates pgd, pud, pmd, pte for each entry: replace that by a standard four-level page table walk like others in mm. Since mmap_sem is held for writing, page_table_lock can be taken at the inner level to limit latency. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:19 -07:00
Hugh Dickins	941150a326	[PATCH] mbind: fix verify_pages pte_page Strict mbind's check that pages already mapped are on right node has been using pte_page without checking if pfn_valid, and without page_table_lock to prevent spurious failures when try_to_unmap_one intervenes between the pte_present and the pte_page. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:19 -07:00
Hugh Dickins	0edd73b334	[PATCH] shmem: restore superblock info To improve shmem scalability, we allowed tmpfs instances which don't need their blocks or inodes limited not to count them, and not to allocate any sbinfo. Which was okay when the only use for the sbinfo was accounting blocks and inodes; but since then a couple of unrelated projects extending tmpfs want to store other data in the sbinfo. Whether either extension reaches mainline is beside the point: I'm guilty of a bad design decision, and should restore sbinfo to make any such future extensions easier. So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited inodes. Brent Casavant verified (many months ago) that this does not perceptibly impact the scalability (since the unlimited sbinfo cacheline is repeatedly accessed but only once dirtied). And merge shmem_set_size into its sole caller shmem_remount_fs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:18 -07:00
Christoph Lameter	2caaad41e4	[PATCH] Reduce size of huge boot per_cpu_pageset Reduce size of the huge per_cpu_pageset structure in __initdata introduced into mm1 with the pageset localization patchset. Use one specially configured pageset per cpu for all zones and nodes during bootup. - Avoid duplication of pageset initialization code. - do the adding to the pageset list before potential free_pages_bulk in free_hot_cold_page (otherwise we would have to hold a page in a pageset during the period that the boot pagesets are in use). - remove mistaken __cpuinitdata attribute and revert back to __initdata for the boot pageset. A boot pageset is not necessary for cpu hotplug. Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of sparsemem patches) Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:18 -07:00
Christoph Lameter	4ae7c03943	[PATCH] Periodically drain non local pagesets The pageset array can potentially acquire a huge amount of memory on large NUMA systems. F.e. on a system with 512 processors and 256 nodes there will be 256*512 pagesets. If each pageset only holds 5 pages then we are talking about 655360 pages.With a 16K page size on IA64 this results in potentially 10 Gigabytes of memory being trapped in pagesets. The typical cases are much less for smaller systems but there is still the potential of memory being trapped in off node pagesets. Off node memory may be rarely used if local memory is available and so we may potentially have memory in seldom used pagesets without this patch. The slab allocator flushes its per cpu caches every 2 seconds. The following patch flushes the off node pageset caches in the same way by tying into the slab flush. The patch also changes /proc/zoneinfo to include the number of pages currently in each pageset. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:18 -07:00
Janet Morgan	578c2fd6a7	[PATCH] add OOM debug This patch provides more debug info when the system is OOM. It displays memory stats (basically sysrq-m info) from __alloc_pages() when page allocation fails and during OOM kill. Thanks to Dave Jones for coming up with the idea. Signed-off-by: Janet Morgan <janetmor@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:17 -07:00
Benjamin LaHaise	c2f29ea111	[PATCH] __read_page_state(): pass unsigned long instead of unsigned By making the offset argument of __read_page_state an unsigned long instead of unsigned, we can avoid forcing the compiler to sign extend a usually constant argument. This saves 1 instruction on x86-64. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:17 -07:00
Benjamin LaHaise	83e5d8f725	[PATCH] __mod_page_state(): pass unsigned long instead of unsigned By making the offset argument of __mod_page_state an unsigned long instead of unsigned, we can avoid forcing the compiler to sign extend a usually constant argument. This saves 1 instruction on x86-64. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:17 -07:00
Darren Hart	1ad539b2bd	[PATCH] vm: try_to_free_pages unused argument try_to_free_pages accepts a third argument, order, but hasn't used it since before 2.6.0. The following patch removes the argument and updates all the calls to try_to_free_pages. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:17 -07:00
Chris Wright	73219d1780	[PATCH] mmap topdown fix for large stack limit, large allocation The topdown changes in 2.6.12-rc1 can cause large allocations with large stack limit to fail, despite there being space available. The mmap_base-len is only valid when len >= mmap_base. However, nothing in topdown allocator checks this. It's only (now) caught at higher level, which will cause allocation to simply fail. The following change restores the fallback to bottom-up path, which will allow large allocations with large stack limit to potentially still succeed. Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-21 18:46:16 -07:00

1 2 3 4 5 ...

303 Commits