linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-15 07:33:56 +00:00

History

Mel Gorman 072bb0aa5e mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the slb allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2012-07-31 18:42:45 -07:00
..
backing-dev.c	mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads	2012-07-31 18:42:40 -07:00
bootmem.c	bootmem: make ___alloc_bootmem_node_nopanic() really nopanic	2012-07-17 16:21:29 -07:00
bounce.c	bounce: allow use of bounce pool via config option	2012-07-18 16:40:35 -04:00
cleancache.c	->encode_fh() API change	2012-05-29 23:28:33 -04:00
compaction.c	mm: have order > 0 compaction start off where it left	2012-07-31 18:42:43 -07:00
debug-pagealloc.c	mm, x86: Remove debug_pagealloc_enabled	2011-12-06 09:24:07 +01:00
dmapool.c	mm: fix implicit stat.h usage in dmapool.c	2011-10-31 09:20:12 -04:00
fadvise.c	mm, fadvise: don't return -EINVAL when filesystem cannot implement fadvise()	2012-07-31 18:42:42 -07:00
failslab.c	switch debugfs to umode_t	2012-01-03 22:54:56 -05:00
filemap_xip.c	fs: introduce inode operation ->update_time	2012-06-01 12:07:25 -04:00
filemap.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2012-06-01 10:34:35 -07:00
fremap.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
frontswap.c	mm/frontswap: cleanup doc and comment error	2012-07-23 11:16:20 -04:00
highmem.c	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux	2011-11-06 19:44:47 -08:00
huge_memory.c	mm/memcg: apply add/del_page to lruvec	2012-05-29 16:22:28 -07:00
hugetlb_cgroup.c	hugetlb/cgroup: remove exclude and wakeup rmdir calls from migrate	2012-07-31 18:42:41 -07:00
hugetlb.c	hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.	2012-07-31 18:42:41 -07:00
hwpoison-inject.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
init-mm.c	atomic: use <linux/atomic.h>	2011-07-26 16:49:47 -07:00
internal.h	mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages	2012-07-31 18:42:45 -07:00
Kconfig	mm: factor out memory isolate functions	2012-07-31 18:42:45 -07:00
Kconfig.debug	mm: more intensive memory corruption debugging	2012-01-10 16:30:42 -08:00
kmemcheck.c
kmemleak-test.c	kmemleak: remove memset by using kzalloc	2011-01-27 18:31:51 +00:00
kmemleak.c	kmemleak: Disable early logging when kmemleak is off by default	2012-01-20 16:57:05 +00:00
ksm.c	ksm: cleanup: introduce find_mergeable_vma()	2012-03-21 17:54:59 -07:00
maccess.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
madvise.c	mm: Hold a file reference in madvise_remove	2012-07-06 10:34:38 -07:00
Makefile	mm: factor out memory isolate functions	2012-07-31 18:42:45 -07:00
memblock.c	mm/memblock.c:memblock_double_array(): cosmetic cleanups	2012-07-31 18:42:41 -07:00
memcontrol.c	mm, memcg: move all oom handling to memcontrol.c	2012-07-31 18:42:45 -07:00
memory_hotplug.c	mm/hotplug: free zone->pageset when a zone becomes empty	2012-07-31 18:42:44 -07:00
memory-failure.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
memory.c	mm/memory.c:print_vma_addr(): call up_read(&mm->mmap_sem) directly	2012-07-31 18:42:43 -07:00
mempolicy.c	Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux	2012-07-30 11:32:24 -07:00
mempool.c	mempool: fix first round failure behavior	2012-01-10 16:30:45 -08:00
migrate.c	hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration	2012-07-31 18:42:41 -07:00
mincore.c	mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode	2012-03-21 17:54:54 -07:00
mlock.c	vm: avoid using find_vma_prev() unnecessarily	2012-03-06 18:23:36 -08:00
mm_init.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
mmap.c	mm: account the total_vm in the vm_stat_account()	2012-07-31 18:42:39 -07:00
mmu_context.c	mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat()	2012-03-21 17:54:59 -07:00
mmu_notifier.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
mmzone.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
mprotect.c	Merge branch 'akpm' (Andrew's patch-bomb)	2012-03-22 09:04:48 -07:00
mremap.c	mm: account the total_vm in the vm_stat_account()	2012-07-31 18:42:39 -07:00
msync.c
nobootmem.c	memblock: free allocated memblock_reserved_regions later	2012-07-11 16:04:50 -07:00
nommu.c	nommu: fix compilation of nommu.c	2012-06-04 17:17:31 -04:00
oom_kill.c	mm, memcg: move all oom handling to memcontrol.c	2012-07-31 18:42:45 -07:00
page_alloc.c	mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages	2012-07-31 18:42:45 -07:00
page_cgroup.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
page_io.c	frontswap: s/put_page/store/g s/get_page/load	2012-05-15 11:34:08 -04:00
page_isolation.c	memory-hotplug: fix kswapd looping forever problem	2012-07-31 18:42:45 -07:00
page-writeback.c	writeback: Fix some comment errors	2012-06-09 19:54:47 +08:00
pagewalk.c	mm: fix kernel-doc warnings	2012-06-20 14:39:36 -07:00
percpu-km.c
percpu-vm.c	mm: fix kernel-doc warnings	2012-06-20 14:39:36 -07:00
percpu.c	kmemleak: Fix the kmemleak tracking of the percpu areas with !SMP	2012-05-09 10:13:29 -07:00
pgtable-generic.c	arch/tile: allow building Linux with transparent huge pages enabled	2012-05-25 12:48:21 -04:00
prio_tree.c	sanitize <linux/prefetch.h> usage	2011-05-20 12:50:29 -07:00
process_vm_access.c	aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()	2012-05-31 17:49:32 -07:00
quicklist.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
readahead.c	mm: move readahead syscall to mm/readahead.c	2012-05-29 16:22:23 -07:00
rmap.c	mm: remove swap token code	2012-05-29 16:22:19 -07:00
shmem.c	don't pass nameidata to ->create()	2012-07-14 16:34:47 +04:00
slab_common.c	mm: Fix build warning in kmem_cache_create()	2012-07-30 13:15:40 +03:00
slab.c	mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages	2012-07-31 18:42:45 -07:00
slab.h	mm, sl[aou]b: Use a common mutex definition	2012-07-09 12:13:41 +03:00
slob.c	slob: Fix early boot kernel crash	2012-07-12 10:13:22 +03:00
slub.c	mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages	2012-07-31 18:42:45 -07:00
sparse-vmemmap.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
sparse.c	mm: setup pageblock_order before it's used by sparsemem	2012-07-31 18:42:43 -07:00
swap_state.c	swap: allow swap readahead to be merged	2012-07-31 18:42:39 -07:00
swap.c	mm/memcg: apply add/del_page to lruvec	2012-05-29 16:22:28 -07:00
swapfile.c	swap: fix shmem swapping when more than 8 areas	2012-06-15 21:48:14 -07:00
truncate.c	mm/fs: remove truncate_range	2012-05-29 16:22:23 -07:00
util.c	new helper: vm_mmap_pgoff()	2012-06-01 10:37:18 -04:00
vmalloc.c	mm: make vb_alloc() more foolproof	2012-07-31 18:42:39 -07:00
vmscan.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
vmstat.c	mm/vmstat.c: remove debug fs entries on failure of file creation and made extfrag_debug_root dentry local	2012-05-29 16:22:19 -07:00