linux/Documentation
Mel Gorman 90afa5de6f vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim
A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.

The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall.  The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.

This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation.  It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().

Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
	scan should go ahead. The broken heuristic is what was causing the
	malloc() stall as it uselessly scanned the LRU constantly. Currently,
	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
	could not deal with tmpfs pages at all. This fixes up the heuristic so
	that an unnecessary scan is more likely to be correctly avoided.

Patch 2 notes that zone_reclaim() returning a failure automatically means
	the zone is marked full. This is not always true. It could have
	failed because the GFP mask or zone_reclaim_mode were unsuitable.

Patch 3 introduces a counter zreclaim_failed that will increment each
	time the zone_reclaim scan-avoidance heuristics fail. If that
	counter is rapidly increasing, then zone_reclaim_mode should be
	set to 0 as a temporarily resolution and a bug reported because
	the scan-avoidance heuristic is still broken.

This patch:

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim.  On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.

There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.

Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with.  Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files.  This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.

This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.

[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:45 -07:00
..
ABI Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block 2009-06-11 11:10:35 -07:00
accounting Documentation/accounting/getdelays.c: fix endless loop 2009-01-15 16:39:37 -08:00
acpi ACPI: update debug parameter documentation 2008-11-07 21:45:29 -05:00
aoe aoe: user can ask driver to forget previously detected devices 2008-02-08 09:22:31 -08:00
arm [ARM] S3C24XX: GPIO: Change to macros for GPIO numbering 2009-05-18 16:26:03 +01:00
auxdisplay .gitignore updates 2008-10-30 11:38:45 -07:00
blackfin Blackfin arch: Add document about bfin-gpio 2009-01-07 23:14:38 +08:00
block trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
blockdev mflash: initial support 2009-04-07 08:12:38 +02:00
cdrom doc/cdrom: Trvial documentation error, file not present 2008-10-10 08:22:44 +02:00
cgroups memcg: fix documentation 2009-04-13 15:04:33 -07:00
connector Documentation/connector/cn_test.c: don't use gfp_any() 2009-02-12 16:47:01 -08:00
console Typo: fro -> from 2007-07-19 10:04:47 -07:00
cpu-freq [CPUFREQ] ondemand/conservative: sanitize sampling_rate restrictions 2009-02-24 22:47:31 -05:00
cpuidle cpuidle: Add Documentation 2008-02-14 00:16:13 -05:00
cris fix random typos 2008-10-16 11:21:30 -07:00
crypto async_tx, dmaengine: document channel allocation and api rework 2009-01-05 18:10:19 -07:00
development-process docs: Encourage better changelogs in the development process document 2009-06-04 10:32:49 -06:00
device-mapper dm crypt: add documentation 2008-04-25 13:27:03 +01:00
DocBook Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-06-15 03:02:23 -07:00
driver-model trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
dvb V4L/DVB (11138): get_dvb_firmware: add support for downloading the cx2584x firmware for pvrusb2 2009-03-30 12:43:31 -03:00
early-userspace Documentation: Remove last references to BitKeeper. 2008-04-21 22:19:05 +00:00
fault-injection fault-injection: fix example scripts in documentation 2007-07-16 09:05:45 -07:00
fb trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
filesystems oom: move oom_adj value from task_struct to mm_struct 2009-06-16 19:47:43 -07:00
firmware_class firmware_sample_driver.c: fix coding style 2008-04-21 22:23:30 +00:00
frv move frv docs one level up 2008-02-03 15:54:28 +02:00
hwmon hwmon: Update documentation on fan_max 2009-06-01 13:46:50 +02:00
i2c i2c-ocores: Can add I2C devices to the bus 2009-06-13 10:39:28 +01:00
i2o documentation: convert the Documentation directory to UTF-8 2007-05-09 08:58:19 +02:00
ia64 trivial: Fix misspelling of firmware 2009-03-30 15:21:59 +02:00
ide ide: preserve Host Protected Area by default (v2) 2009-06-07 13:52:52 +02:00
infiniband IPoIB: Document newish features 2009-04-08 13:52:01 -07:00
input Input: multitouch - augment event semantics documentation 2009-05-23 09:53:26 -07:00
ioctl V4L/DVB (10870a): remove all references for video_decoder.h 2009-03-30 12:43:15 -03:00
isdn isdn: extend INTERFACE.CAPI document 2009-06-08 00:45:52 -07:00
ja_JP Sync patch for jp_JP/stable_kernel_rules.txt 2009-01-28 15:55:48 -08:00
kbuild kconfig: resort the documentation of the environment variables 2009-06-09 22:37:47 +02:00
kdump trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
ko_KR HOWTO: update misspelling and word incorrected 2007-12-17 10:33:19 -08:00
laptops trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
lguest lguest: add support for indirect ring entries 2009-06-12 22:27:13 +09:30
m68k [SCSI] 53c7xx: fix removal fallout 2008-01-11 18:22:30 -06:00
make Documentation/make/headers_install.txt 2007-10-17 08:43:05 -07:00
mips ide: remove unused CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ 2009-01-14 19:19:03 +01:00
misc-devices drivers/misc/isl29003.c: driver for the ISL29003 ambient light sensor 2009-04-01 08:59:18 -07:00
mn10300 trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
mtd trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
namespaces The namespaces compatibility list doc 2007-11-29 09:24:53 -08:00
netlabel Fix occurrences of "the the " 2007-05-09 08:57:56 +02:00
networking Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-06-15 03:02:23 -07:00
parisc
PCI PCI MSI: Add example request loop to MSI-HOWTO.txt 2009-03-20 11:35:04 -07:00
pcmcia .gitignore updates 2008-10-30 11:38:45 -07:00
power Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-06-14 13:46:25 -07:00
powerpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-06-15 09:40:05 -07:00
prctl generic, x86: add tests for prctl PR_GET_TSC and PR_SET_TSC 2008-04-19 19:19:55 +02:00
RCU trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
s390 trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
scheduler trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
scsi trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
serial Create/use more directory structure in the Documentation/ tree. 2008-11-14 17:28:53 +00:00
sh sh: Kill off remaining CONFIG_SH_KGDB bits. 2008-12-22 18:44:05 +09:00
sound Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-06-14 13:46:25 -07:00
sparc sparc: Remove Documentation/sparc/sbus_drivers.txt 2008-08-29 02:15:25 -07:00
spi spi: documentation: emphasise spi_master.setup() semantics 2009-04-21 13:41:50 -07:00
sysctl vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim 2009-06-16 19:47:45 -07:00
telephony remove mention of CONFIG_KMOD from documentation 2008-07-22 19:24:29 +10:00
thermal thermal: update the documentation 2008-04-29 02:49:47 -04:00
timers trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
trace trivial: Remove the hyphen from git commands 2009-06-12 18:01:51 +02:00
uml
usb trivial: usb: fix missing space typo in doc 2009-06-12 18:01:51 +02:00
video4linux trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
vm pagemap: add page-types tool 2009-06-16 19:47:38 -07:00
w1 w1: send status messages after command processing 2009-01-08 08:31:14 -08:00
watchdog .gitignore updates 2008-10-30 11:38:45 -07:00
wimax i2400m: documentation and instructions for usage 2009-01-07 10:00:18 -08:00
x86 Merge branch 'linus' into x86/mce3 2009-06-11 23:31:52 +02:00
zh_CN Chinese: add translation of Codingstyle 2008-01-24 20:40:04 -08:00
00-INDEX trivial: fix where cgroup documentation is not correctly referred to 2009-03-30 15:22:02 +02:00
applying-patches.txt
atomic_ops.txt documentation: atomic_add_unless() doesn't imply mb() on failure 2008-02-23 17:52:36 -08:00
bad_memory.txt Document handling of bad memory 2008-12-03 16:09:53 -07:00
basic_profiling.txt
binfmt_misc.txt documentation: convert the Documentation directory to UTF-8 2007-05-09 08:58:19 +02:00
braille-console.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
bt8xxgpio.txt gpio: add bt8xxgpio driver 2008-07-25 10:53:30 -07:00
BUG-HUNTING Documentation: add hint about call traces & module symbols to BUG-HUNTING 2008-02-06 10:41:09 -08:00
c2port.txt Add c2 port support 2008-11-12 17:17:18 -08:00
cachetlb.txt remove unused flush_tlb_pgtables 2007-10-19 11:53:34 -07:00
Changes Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-next 2009-06-14 14:12:18 -07:00
CodingStyle trivial: fix typo milisecond/millisecond for documentation and source comments. 2009-06-12 18:01:46 +02:00
cpu-hotplug.txt x86: use possible_cpus=NUM to extend the possible cpus allowed 2008-12-18 12:08:05 +01:00
cpu-load.txt
cputopology.txt cpumask: Use topology_core_cpumask()/topology_thread_cpumask() 2009-01-11 19:12:49 +01:00
credentials.txt CRED: Documentation 2008-11-14 10:39:26 +11:00
dcdbas.txt
debugging-modules.txt Documentation: Clarify when module debugging actually works. 2008-02-03 15:27:38 +02:00
debugging-via-ohci1394.txt firewire: fw-ohci: add option for remote debugging 2008-04-18 17:55:33 +02:00
dell_rbu.txt trivial: Documentation/dell_rbu.txt: fix typos 2009-06-12 18:01:50 +02:00
devices.txt lanana: assign a device name and numbering for MAX3100 2009-04-07 08:44:05 -07:00
DMA-API.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
DMA-attributes.txt powerpc/cell: Add DMA_ATTR_WEAK_ORDERING dma attribute and use in Cell IOMMU code 2008-07-22 10:39:36 +10:00
DMA-ISA-LPC.txt
DMA-mapping.txt dma-mapping: update the old macro DMA_nBIT_MASK related documentations 2009-04-07 08:31:12 -07:00
dmaengine.txt async_tx, dmaengine: document channel allocation and api rework 2009-01-05 18:10:19 -07:00
dontdiff dontdiff: Fix asm exclude 2009-03-26 15:45:43 -07:00
dynamic-debug-howto.txt Dynamic debug: allow simple quoting of words 2009-03-24 16:38:27 -07:00
edac.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
eisa.txt
email-clients.txt Documentation/email-clients.txt: add some info about gmail 2008-11-06 15:41:19 -08:00
exception.txt
feature-removal-schedule.txt Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-06-15 03:02:23 -07:00
futex-requeue-pi.txt futex: add requeue-pi documentation 2009-05-09 07:12:50 +02:00
gpio.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
highuid.txt [SPARC]: Remove SunOS and Solaris binary support. 2008-04-21 15:10:15 -07:00
HOWTO Remove Andrew Morton's http://www.zip.com.au/~akpm/ 2008-10-16 11:21:32 -07:00
hw_random.txt hw_random doc updates 2008-03-24 19:22:19 -07:00
ics932s401 ics932s401: new clock generator chip driver 2008-11-12 17:17:18 -08:00
initrd.txt use the newc archive format as requested by initramfs 2008-02-03 14:54:41 +02:00
Intel-IOMMU.txt Documentation cleanup: trivial misspelling, punctuation, and grammar corrections. 2008-07-26 12:00:06 -07:00
io_ordering.txt
io-mapping.txt io mapping: improve documentation 2008-11-03 18:21:44 +01:00
IO-mapping.txt Documentation: move DMA-mapping.txt to Doc/PCI/ 2009-01-29 18:19:29 -08:00
iostats.txt Documentation cleanup: trivial misspelling, punctuation, and grammar corrections. 2008-07-26 12:00:06 -07:00
IPMI.txt IPMI: new NMI handling 2007-10-18 14:37:32 -07:00
IRQ-affinity.txt genirq: Expose default irq affinity mask (take 3) 2008-06-05 15:18:30 +02:00
IRQ.txt
irqflags-tracing.txt
isapnp.txt
java.txt Documentation/java.txt: typo and grammar fixes 2007-10-20 02:37:21 +02:00
kernel-doc-nano-HOWTO.txt kernel-doc: restrict syntax for private: and public: 2009-05-02 15:36:10 -07:00
kernel-docs.txt doc: update to URL and status of kernel-docs.txt entry 2008-06-06 11:29:10 -07:00
kernel-parameters.txt Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2009-06-15 09:32:52 -07:00
keys-request-key.txt keys: allow the callout data to be passed as a blob rather than a string 2008-04-29 08:06:16 -07:00
keys.txt Documentation cleanup: trivial misspelling, punctuation, and grammar corrections. 2008-07-26 12:00:06 -07:00
kmemleak.txt kmemleak: Add documentation on the memory leak detector 2009-06-11 17:03:29 +01:00
kobject.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
kprobes.txt kprobes: support kretprobe and jprobe per-probe disabling 2009-04-07 08:31:08 -07:00
kref.txt docs: convert kref semaphore to mutex 2008-02-06 10:41:09 -08:00
ldm.txt LDM: Fix for Windows Vista dynamic disks 2007-05-21 09:58:40 -07:00
leds-class.txt Documentation cleanup: trivial misspelling, punctuation, and grammar corrections. 2008-07-26 12:00:06 -07:00
local_ops.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
lockdep-design.txt locking: Documentation: lockdep-design.txt, fix note of state bits 2009-04-26 18:21:24 +02:00
lockstat.txt lockstat: contend with points 2008-10-20 15:43:10 +02:00
logo.gif Revert "linux.conf.au 2009: Tuz" 2009-04-27 12:00:27 -07:00
logo.txt Revert "linux.conf.au 2009: Tuz" 2009-04-27 12:00:27 -07:00
magic-number.txt documentation: update header file paths 2009-01-06 15:59:28 -08:00
Makefile docsrc: build Documentation/ sources 2008-08-12 16:07:30 -07:00
ManagementStyle docs: fix ManagementStyle book name 2008-10-30 11:38:46 -07:00
markers.txt markers: comment marker_synchronize_unregister() on data dependency 2008-11-28 16:47:41 +01:00
mca.txt The ps2esdi driver was marked as BROKEN more than two years ago due to being 2008-03-17 09:03:05 +01:00
md.txt Documentation/md.txt update 2009-03-31 15:18:37 +11:00
memory-barriers.txt sched: Document memory barriers implied by sleep/wake-up primitives 2009-04-29 14:15:55 +02:00
memory-hotplug.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
memory.txt
mono.txt
mutex-design.txt Documentation: Add nested versions of mutex locks to docs 2007-10-20 00:15:26 +02:00
nmi_watchdog.txt x86, nmi-watchdog: update procfs nmi_watchdog file documentation v2 2008-10-30 19:07:04 +01:00
nommu-mmap.txt NOMMU: Make mmap allocation page trimming behaviour configurable. 2009-01-08 12:04:47 +00:00
numastat.txt
oops-tracing.txt Taint kernel after WARN_ON(condition) 2008-04-29 08:05:59 -07:00
parport-lowlevel.txt plip: fix parport_register_device name parameter 2007-11-26 19:39:01 -08:00
parport.txt
pi-futex.txt
pnp.txt Documentation: Replace obsolete "driverfs" with "sysfs". 2008-01-24 20:40:04 -08:00
preempt-locking.txt
printk-formats.txt DOC: add printk-formats.txt 2008-11-12 17:17:17 -08:00
prio_tree.txt
rbtree.txt trivial: rbtree.txt: fix rb_entry() parameters in sample code 2009-06-12 18:01:47 +02:00
rfkill.txt rfkill: document /dev/rfkill 2009-06-03 14:06:15 -04:00
robust-futex-ABI.txt
robust-futexes.txt
rt-mutex-design.txt
rt-mutex.txt
rtc.txt rtc: cleanup example code 2008-02-06 10:41:14 -08:00
SAK.txt Remove Andrew Morton's old email accounts 2008-10-16 11:21:32 -07:00
SecurityBugs
SELinux.txt selinux: add support for installing a dummy policy (v2) 2008-08-27 08:54:08 +10:00
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
slow-work.txt Document the slow work thread pool 2009-04-03 16:42:35 +01:00
SM501.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
Smack.txt smack: implement logging V3 2009-04-14 09:00:23 +10:00
sparse.txt Documentation: explain the difference between __bitwise and __bitwise__ 2009-04-11 08:18:11 +02:00
spinlocks.txt Add additional examples in Documentation/spinlocks.txt 2008-04-11 13:21:14 -06:00
stable_api_nonsense.txt stable_api_nonsense.txt: Disambiguate the use of "this" by using "that" to refer to the syscall interface 2007-07-30 14:25:12 -07:00
stable_kernel_rules.txt Update stable tree documentation 2008-10-29 15:03:49 -07:00
SubmitChecklist documentation: explain memory barriers 2008-10-16 11:21:32 -07:00
SubmittingDrivers Remove Andrew Morton's old email accounts 2008-10-16 11:21:32 -07:00
SubmittingPatches Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-06-14 13:46:25 -07:00
svga.txt
sysfs-rules.txt Doc/sysfs-rules: Swap the order of the words so the sentence makes more sense 2009-05-08 19:22:20 -07:00
sysrq.txt Merge branch 'tracing/core-v2' into tracing-for-linus 2009-04-02 00:49:02 +02:00
tomoyo.txt tomoyo: add Documentation/tomoyo.txt 2009-04-14 09:14:58 +10:00
unaligned-memory-access.txt introduce HAVE_EFFICIENT_UNALIGNED_ACCESS Kconfig symbol 2008-07-25 10:53:27 -07:00
unicode.txt
unshare.txt
VGA-softcursor.txt
video-output.txt
volatile-considered-harmful.txt Documentation cleanup: trivial misspelling, punctuation, and grammar corrections. 2008-07-26 12:00:06 -07:00
voyager.txt
zorro.txt