linux/Documentation/vm/balance.rst

.. _balance:

================
Memory Balancing
================

Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>

Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
well as for non __GFP_IO allocations.

The first reason why a caller may avoid reclaim is that the caller can not
sleep due to holding a spinlock or is in interrupt context. The second may
be that the caller is willing to fail the allocation without incurring the
overhead of page reclaim. This may happen for opportunistic high-order
allocation requests that have order-0 fallback options. In such cases,
the caller may also wish to avoid waking kswapd.

__GFP_IO allocation requests are made to prevent file system deadlocks.

In the absence of non sleepable allocation requests, it seems detrimental
to be doing balancing. Page reclamation can be kicked off lazily, that
is, only when needed (aka zone free memory is 0), instead of making it
a proactive process.

That being said, the kernel should try to fulfill requests for direct
mapped pages from the direct mapped pool, instead of falling back on
the dma pool, so as to keep the dma pool filled for dma requests (atomic
or not). A similar argument applies to highmem and direct mapped pages.
OTOH, if there is a lot of free dma pages, it is preferable to satisfy
regular memory requests by allocating one from the dma pool, instead
of incurring the overhead of regular zone balancing.

In 2.2, memory balancing/page reclamation would kick off only when the
_total_ number of free pages fell below 1/64 th of total memory. With the
right ratio of dma and regular memory, it is quite possible that balancing
would not be done even when the dma zone was completely empty. 2.2 has
been running production machines of varying memory sizes, and seems to be
doing fine even with the presence of this problem. In 2.3, due to
HIGHMEM, this problem is aggravated.

In 2.3, zone balancing can be done in one of two ways: depending on the
zone size (and possibly of the size of lower class zones), we can decide
at init time how many free pages we should aim for while balancing any
zone. The good part is, while balancing, we do not need to look at sizes
of lower class zones, the bad part is, we might do too frequent balancing
due to ignoring possibly lower usage in the lower class zones. Also,
with a slight change in the allocation routine, it is possible to reduce
the memclass() macro to be a simple equality.

Another possible solution is that we balance only when the free memory
of a zone _and_ all its lower class zones falls below 1/64th of the
total memory in the zone and its lower class zones. This fixes the 2.2
balancing problem, and stays as close to 2.2 behavior as possible. Also,
the balancing algorithm works the same way on the various architectures,
which have different numbers and types of zones. If we wanted to get
fancy, we could assign different weights to free pages in different
zones in the future.

Note that if the size of the regular zone is huge compared to dma zone,
it becomes less significant to consider the free dma pages while
deciding whether to balance the regular zone. The first solution
becomes more attractive then.

The appended patch implements the second solution. It also "fixes" two
problems: first, kswapd is woken up as in 2.2 on low memory conditions
for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
so as to give a fighting chance for replace_with_highmem() to get a
HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
fall back into regular zone. This also makes sure that HIGHMEM pages
are not leaked (for example, in situations where a HIGHMEM page is in
the swapcache but is not being used by anyone)

kswapd also needs to know about the zones it should balance. kswapd is
primarily needed in a situation where balancing can not be done,
probably because all allocation requests are coming from intr context
and all process contexts are sleeping. For 2.3, kswapd does not really
need to balance the highmem zone, since intr context does not request
highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
structure to decide whether a zone needs balancing.

Page stealing from process memory and shm is done if stealing the page would
alleviate memory pressure on any zone in the page's node that has fallen below
its watermark.

watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
are per-zone fields, used to determine when a zone needs to be balanced. When
the number of pages falls below watermark[WMARK_MIN], the hysteric field
low_on_memory gets set. This stays set till the number of free pages becomes
watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
try to free some pages in the zone (providing GFP_WAIT is set in the request).
Orthogonal to this, is the decision to poke kswapd to free some zone pages.
That decision is not hysteresis based, and is done when the number of free
pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.


(Good) Ideas that I have heard:

1. Dynamic experience should influence balancing: number of failed requests
   for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
   dma pages. (lkd@tantalophile.demon.co.uk)
docs/vm: balance: convert to ReST format Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> 2018-03-21 19:22:18 +00:00			`.. _balance:`

			`================`
			`Memory Balancing`
			`================`

Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-16 22:20:36 +00:00			`Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>`

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 2015-11-07 00:28:21 +00:00			`Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as`
			`well as for non __GFP_IO allocations.`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-16 22:20:36 +00:00
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 2015-11-07 00:28:21 +00:00			`The first reason why a caller may avoid reclaim is that the caller can not`
			`sleep due to holding a spinlock or is in interrupt context. The second may`
			`be that the caller is willing to fail the allocation without incurring the`
			`overhead of page reclaim. This may happen for opportunistic high-order`
			`allocation requests that have order-0 fallback options. In such cases,`
			`the caller may also wish to avoid waking kswapd.`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-16 22:20:36 +00:00
			`__GFP_IO allocation requests are made to prevent file system deadlocks.`

			`In the absence of non sleepable allocation requests, it seems detrimental`
			`to be doing balancing. Page reclamation can be kicked off lazily, that`
			`is, only when needed (aka zone free memory is 0), instead of making it`
			`a proactive process.`

			`That being said, the kernel should try to fulfill requests for direct`
			`mapped pages from the direct mapped pool, instead of falling back on`
			`the dma pool, so as to keep the dma pool filled for dma requests (atomic`
			`or not). A similar argument applies to highmem and direct mapped pages.`
			`OTOH, if there is a lot of free dma pages, it is preferable to satisfy`
			`regular memory requests by allocating one from the dma pool, instead`
			`of incurring the overhead of regular zone balancing.`

			`In 2.2, memory balancing/page reclamation would kick off only when the`
			`_total_ number of free pages fell below 1/64 th of total memory. With the`
			`right ratio of dma and regular memory, it is quite possible that balancing`
			`would not be done even when the dma zone was completely empty. 2.2 has`
			`been running production machines of varying memory sizes, and seems to be`
			`doing fine even with the presence of this problem. In 2.3, due to`
			`HIGHMEM, this problem is aggravated.`

			`In 2.3, zone balancing can be done in one of two ways: depending on the`
			`zone size (and possibly of the size of lower class zones), we can decide`
			`at init time how many free pages we should aim for while balancing any`
			`zone. The good part is, while balancing, we do not need to look at sizes`
			`of lower class zones, the bad part is, we might do too frequent balancing`
			`due to ignoring possibly lower usage in the lower class zones. Also,`
			`with a slight change in the allocation routine, it is possible to reduce`
			`the memclass() macro to be a simple equality.`

			`Another possible solution is that we balance only when the free memory`
			`of a zone _and_ all its lower class zones falls below 1/64th of the`
			`total memory in the zone and its lower class zones. This fixes the 2.2`
			`balancing problem, and stays as close to 2.2 behavior as possible. Also,`
			`the balancing algorithm works the same way on the various architectures,`
			`which have different numbers and types of zones. If we wanted to get`
			`fancy, we could assign different weights to free pages in different`
			`zones in the future.`

			`Note that if the size of the regular zone is huge compared to dma zone,`
			`it becomes less significant to consider the free dma pages while`
			`deciding whether to balance the regular zone. The first solution`
			`becomes more attractive then.`

			`The appended patch implements the second solution. It also "fixes" two`
			`problems: first, kswapd is woken up as in 2.2 on low memory conditions`
			`for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,`
			`so as to give a fighting chance for replace_with_highmem() to get a`
			`HIGHMEM page, as well as to ensure that HIGHMEM allocations do not`
			`fall back into regular zone. This also makes sure that HIGHMEM pages`
docs/vm: balance: convert to ReST format Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> 2018-03-21 19:22:18 +00:00			`are not leaked (for example, in situations where a HIGHMEM page is in`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-16 22:20:36 +00:00			`the swapcache but is not being used by anyone)`

			`kswapd also needs to know about the zones it should balance. kswapd is`
docs/vm: balance: convert to ReST format Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> 2018-03-21 19:22:18 +00:00			`primarily needed in a situation where balancing can not be done,`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-16 22:20:36 +00:00			`probably because all allocation requests are coming from intr context`
			`and all process contexts are sleeping. For 2.3, kswapd does not really`
			`need to balance the highmem zone, since intr context does not request`
			`highmem pages. kswapd looks at the zone_wake_kswapd field in the zone`
			`structure to decide whether a zone needs balancing.`

			`Page stealing from process memory and shm is done if stealing the page would`
			`alleviate memory pressure on any zone in the page's node that has fallen below`
			`its watermark.`

page allocator: use allocation flags as an index to the zone watermark ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether pages_min, pages_low or pages_high is used as the zone watermark when allocating the pages. Two branches in the allocator hotpath determine which watermark to use. This patch uses the flags as an array index into a watermark array that is indexed with WMARK_* defines accessed via helpers. All call sites that use zone->pages_* are updated to use the helpers for accessing the values and the array offsets for setting. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 2009-06-16 22:32:12 +00:00			`watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These`
			`are per-zone fields, used to determine when a zone needs to be balanced. When`
			`the number of pages falls below watermark[WMARK_MIN], the hysteric field`
			`low_on_memory gets set. This stays set till the number of free pages becomes`
			`watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will`
			`try to free some pages in the zone (providing GFP_WAIT is set in the request).`
			`Orthogonal to this, is the decision to poke kswapd to free some zone pages.`
			`That decision is not hysteresis based, and is done when the number of free`
			`pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-16 22:20:36 +00:00

			`(Good) Ideas that I have heard:`
docs/vm: balance: convert to ReST format Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> 2018-03-21 19:22:18 +00:00
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-16 22:20:36 +00:00			`1. Dynamic experience should influence balancing: number of failed requests`
docs/vm: balance: convert to ReST format Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> 2018-03-21 19:22:18 +00:00			`for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-16 22:20:36 +00:00			`2. Implement a replace_with_highmem()-like replace_with_regular() to preserve`
docs/vm: balance: convert to ReST format Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> 2018-03-21 19:22:18 +00:00			`dma pages. (lkd@tantalophile.demon.co.uk)`