At first look, mark_page_accessed() in follow_page() seems a bit strange.
It seems pte_mkyoung() would be better consistent with other kernel code.
However, it is intentional. The commit log said:
------------------------------------------------
commit 9e45f61d69be9024a2e6bef3831fb04d90fac7a8
Author: akpm <akpm>
Date: Fri Aug 15 07:24:59 2003 +0000
[PATCH] Use mark_page_accessed() in follow_page()
Touching a page via follow_page() counts as a reference so we should be
either setting the referenced bit in the pte or running mark_page_accessed().
Altering the pte is tricky because we haven't implemented an atomic
pte_mkyoung(). And mark_page_accessed() is better anyway because it has more
aging state: it can move the page onto the active list.
BKrev: 3f3c8acbplT8FbwBVGtth7QmnqWkIw
------------------------------------------------
The atomic issue is still true nowadays. adding comment help to understand
code intention and it would be better.
[akpm@linux-foundation.org: clarify text]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shrink_inactive_list() scans in sc->swap_cluster_max chunks until it hits
the scan limit it was passed.
shrink_inactive_list()
{
do {
isolate_pages(swap_cluster_max)
shrink_page_list()
} while (nr_scanned < max_scan);
}
This assumes that swap_cluster_max is not bigger than the scan limit
because the latter is checked only after at least one iteration.
In shrink_all_memory() sc->swap_cluster_max is initialized to the overall
reclaim goal in the beginning but not decreased while reclaim is making
progress which leads to subsequent calls to shrink_inactive_list()
reclaiming way too much in the one iteration that is done unconditionally.
Set sc->swap_cluster_max always to the proper goal before doing
shrink_all_zones()
shrink_list()
shrink_inactive_list().
While the current shrink_all_memory() happily reclaims more than actually
requested, this patch fixes it to never exceed the goal:
unpatched
wanted=10000 reclaimed=13356
wanted=10000 reclaimed=19711
wanted=10000 reclaimed=10289
wanted=10000 reclaimed=17306
wanted=10000 reclaimed=10700
wanted=10000 reclaimed=10004
wanted=10000 reclaimed=13301
wanted=10000 reclaimed=10976
wanted=10000 reclaimed=10605
wanted=10000 reclaimed=10088
wanted=10000 reclaimed=15000
patched
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=9599
wanted=10000 reclaimed=8476
wanted=10000 reclaimed=8326
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=9919
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=9624
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=8500 reclaimed=8092
wanted=316 reclaimed=316
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: MinChan Kim <minchan.kim@gmail.com>
Acked-by: Nigel Cunningham <ncunningham@crca.org.au>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a79311c14e "vmscan: bail out of
direct reclaim after swap_cluster_max pages" moved the nr_reclaimed
counter into the scan control to accumulate the number of all reclaimed
pages in a reclaim invocation.
shrink_all_memory() can use the same mechanism. it increase code
consistency and redability.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit bf3f3bc5e7 (mm: don't
mark_page_accessed in fault path) only remove the mark_page_accessed() in
filemap_fault().
Therefore, swap-backed pages and file-backed pages have inconsistent
behavior. mark_page_accessed() should be removed from do_swap_page().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanup
In almost cases, for_each_zone() is used with populated_zone(). It's
because almost function doesn't need memoryless node information.
Therefore, for_each_populated_zone() can help to make code simplify.
This patch has no functional change.
[akpm@linux-foundation.org: small cleanup]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sc.may_swap does not only influence reclaiming of anon pages but pages
mapped into pagetables in general, which also includes mapped file pages.
In shrink_page_list():
if (!sc->may_swap && page_mapped(page))
goto keep_locked;
For anon pages, this makes sense as they are always mapped and reclaiming
them always requires swapping.
But mapped file pages are skipped here as well and it has nothing to do
with swapping.
The real effect of the knob is whether mapped pages are unmapped and
reclaimed or not. Rename it to `may_unmap' to have its name match its
actual meaning more precisely.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: MinChan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need to call for int_sqrt if argument is 0.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmap's dirty_list is unused. It's for optimizing flushing. but Nick
didn't write the code yet. so, we don't need it until time as it is
needed.
This patch removes vmap_block's dirty_list and codes related to it.
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case if start_pfn overlap the upper bound no need to test end_pfn again
since we have it already trimmed.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: build fix
fix typo in mm/slob.c:
mm/slob.c:469: error: ‘flags’ undeclared (first use in this function)
mm/slob.c:469: error: (Each undeclared identifier is reported only once
mm/slob.c:469: error: for each function it appears in.)
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090128135457.350751756@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-stage-3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (190 commits)
Revert "cpuacct: reduce one NULL check in fast-path"
Revert "x86: don't compile vsmp_64 for 32bit"
x86: Correct behaviour of irq affinity
x86: early_ioremap_init(), use __fix_to_virt(), because we are sure it's safe
x86: use default_cpu_mask_to_apicid for 64bit
x86: fix set_extra_move_desc calling
x86, PAT, PCI: Change vma prot in pci_mmap to reflect inherited prot
x86/dmi: fix dmi_alloc() section mismatches
x86: e820 fix various signedness issues in setup.c and e820.c
x86: apic/io_apic.c define msi_ir_chip and ir_ioapic_chip all the time
x86: irq.c keep CONFIG_X86_LOCAL_APIC interrupts together
x86: irq.c use same path for show_interrupts
x86: cpu/cpu.h cleanup
x86: Fix a couple of sparse warnings in arch/x86/kernel/apic/io_apic.c
Revert "x86: create a non-zero sized bm_pte only when needed"
x86: pci-nommu.c cleanup
x86: io_delay.c cleanup
x86: rtc.c cleanup
x86: i8253 cleanup
x86: kdebugfs.c cleanup
...
It doesn't change the semantics, but it looks like
the logical 'or' was meant to be used here.
Signed-off-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Impact: cleanup
Time to clean up remaining laggards using the old cpu_ functions.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Trond.Myklebust@netapp.com
Impact: cleanup
(Thanks to Al Viro for reminding me of this, via Ingo)
CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so:
#define CPU_MASK_ALL (cpumask_t) { { ... } }
Taking the address of such a temporary is questionable at best,
unfortunately 321a8e9d (cpumask: add CPU_MASK_ALL_PTR macro) added
CPU_MASK_ALL_PTR:
#define CPU_MASK_ALL_PTR (&CPU_MASK_ALL)
Which formalizes this practice. One day gcc could bite us over this
usage (though we seem to have gotten away with it so far).
So replace everywhere which used &CPU_MASK_ALL or CPU_MASK_ALL_PTR
with the modern "cpu_all_mask" (a real const struct cpumask *).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Travis <travis@sgi.com>
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (422 commits)
[ARM] 5435/1: fix compile warning in sanity_check_meminfo()
[ARM] 5434/1: ARM: OMAP: Fix mailbox compile for 24xx
[ARM] pxa: fix the bad assumption that PCMCIA sockets always start with 0
[ARM] pxa: fix Colibri PXA300 and PXA320 LCD backlight pins
imxfb: Fix TFT mode
i.MX21/27: remove ifdef CONFIG_FB_IMX
imxfb: add clock support
mxc: add arch_reset() function
clkdev: add possibility to get a clock based on the device name
i.MX1: remove fb support from mach-imx
[ARM] pxa: build arch/arm/plat-pxa/mfp.c only when PXA3xx or ARCH_MMP defined
Gemini: Add support for Teltonika RUT100
Gemini: gpiolib based GPIO support v2
MAINTAINERS: add myself as Gemini architecture maintainer
ARM: Add Gemini architecture v3
[ARM] OMAP: Fix compile for omap2_init_common_hw()
MAINTAINERS: Add myself as Faraday ARM core variant maintainer
ARM: Add support for FA526 v2
[ARM] acorn,ebsa110,footbridge,integrator,sa1100: Convert asm/io.h to linux/io.h
[ARM] collie: fix two minor formatting nits
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slob: fix lockup in slob_free()
slub: use get_track()
slub: rename calculate_min_partial() to set_min_partial()
slub: add min_partial sysfs tunable
slub: move min_partial to struct kmem_cache
SLUB: Fix default slab order for big object sizes
SLUB: Do not pass 8k objects through to the page allocator
SLUB: Introduce and use SLUB_MAX_SIZE and SLUB_PAGE_SHIFT constants
slob: clean up the code
SLUB: Use ->objsize from struct kmem_cache_cpu in slab_free()
* 'for-2.6.30' of git://git.kernel.dk/linux-2.6-block:
Get rid of pdflush_operation() in emergency sync and remount
btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty
Move the default_backing_dev_info out of readahead.c and into backing-dev.c
block: Repeated lines in switching-sched.txt
bsg: Remove bogus check against request_queue->max_sectors
block: WARN in __blk_put_request() for potential bio leak
loop: fix circular locking in loop_clr_fd()
loop: support barrier writes
bsg: add support for tail queuing
cpqarray: enable bus mastering
block: genhd.h cleanup patch
block: add private bio_set for bio integrity allocations
block: genhd.h comment needs updating
block: get rid of unused blkdev_free_rq() define
block: remove various blk_queue_*() setting functions in blk_init_queue_node()
cciss: add BUILD_BUG_ON() for catching bad CommandList_struct alignment
block: don't create bio_vec slabs of less than the inline number
block: cleanup bio_alloc_bioset()
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits)
SELinux: inode_doinit_with_dentry drop no dentry printk
SELinux: new permission between tty audit and audit socket
SELinux: open perm for sock files
smack: fixes for unlabeled host support
keys: make procfiles per-user-namespace
keys: skip keys from another user namespace
keys: consider user namespace in key_permission
keys: distinguish per-uid keys in different namespaces
integrity: ima iint radix_tree_lookup locking fix
TOMOYO: Do not call tomoyo_realpath_init unless registered.
integrity: ima scatterlist bug fix
smack: fix lots of kernel-doc notation
TOMOYO: Don't create securityfs entries unless registered.
TOMOYO: Fix exception policy read failure.
SELinux: convert the avc cache hash list to an hlist
SELinux: code readability with avc_cache
SELinux: remove unused av.decided field
SELinux: more careful use of avd in avc_has_perm_noaudit
SELinux: remove the unused ae.used
SELinux: check seqno when updating an avc_node
...
Enlarge default dirty ratios from 5/10 to 10/20. This fixes [Bug
#12809] iozone regression with 2.6.29-rc6.
The iozone benchmarks are performed on a 1200M file, with 8GB ram.
iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
The performance regression is triggered by commit 1cf6e7d83bf3(mm: task
dirty accounting fix), which makes more correct/thorough dirty
accounting.
The default 5/10 dirty ratios were picked (a) with the old dirty logic
and (b) largely at random and (c) designed to be aggressive. In
particular, that (a) means that having fixed some of the dirty
accounting, maybe the real bug is now that it was always too aggressive,
just hidden by an accounting issue.
The enlarged 10/20 dirty ratios are just about enough to fix the regression.
[ We will have to look at how this affects the old fsync() latency issue,
but that probably will need independent work. - Linus ]
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: "Lin, Ming M" <ming.m.lin@intel.com>
Tested-by: "Lin, Ming M" <ming.m.lin@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't hold SLOB lock when freeing the page. Reduces lock hold width. See
the following thread for discussion of the bug:
http://marc.info/?l=linux-kernel&m=123709983214143&w=2
Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Use get_track() in set_track()
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Impact: build fix on SH !CONFIG_MMU
Stephen Rothwell reported this linux-next build failure on the SH
architecture:
kernel/built-in.o: In function `disable_all_kprobes':
kernel/kprobes.c:1382: undefined reference to `text_mutex'
[...]
And observed:
| Introduced by commit 4460fdad85 ("tracing,
| Text Edit Lock - kprobes architecture independent support") from the
| tracing tree. text_mutex is defined in mm/memory.c which is only built
| if CONFIG_MMU is defined, which is not true for sh allmodconfig.
Move this lock to kernel/extable.c (which is already home to various
kernel text related routines), which file is always built-in.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
LKML-Reference: <20090320110602.86351a91.sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Most ARM machines have a non IO coherent cache, meaning that the
dma_map_*() set of functions must clean and/or invalidate the affected
memory manually before DMA occurs. And because the majority of those
machines have a VIVT cache, the cache maintenance operations must be
performed using virtual
addresses.
When a highmem page is kunmap'd, its mapping (and cache) remains in place
in case it is kmap'd again. However if dma_map_page() is then called with
such a page, some cache maintenance on the remaining mapping must be
performed. In that case, page_address(page) is non null and we can use
that to synchronize the cache.
It is unlikely but still possible for kmap() to race and recycle the
virtual address obtained above, and use it for another page before some
on-going cache invalidation loop in dma_map_page() is done. In that case,
the new mapping could end up with dirty cache lines for another page,
and the unsuspecting cache invalidation loop in dma_map_page() might
simply discard those dirty cache lines resulting in data loss.
For example, let's consider this sequence of events:
- dma_map_page(..., DMA_FROM_DEVICE) is called on a highmem page.
--> - vaddr = page_address(page) is non null. In this case
it is likely that the page has valid cache lines
associated with vaddr. Remember that the cache is VIVT.
--> for (i = vaddr; i < vaddr + PAGE_SIZE; i += 32)
invalidate_cache_line(i);
*** preemption occurs in the middle of the loop above ***
- kmap_high() is called for a different page.
--> - last_pkmap_nr wraps to zero and flush_all_zero_pkmaps()
is called. The pkmap_count value for the page passed
to dma_map_page() above happens to be 1, so the page
is unmapped. But prior to that, flush_cache_kmaps()
cleared the cache for it. So far so good.
- A fresh pkmap entry is assigned for this kmap request.
The Murphy law says this pkmap entry will eventually
happen to use the same vaddr as the one which used to
belong to the other page being processed by
dma_map_page() in the preempted thread above.
- The kmap_high() caller start dirtying the cache using the
just assigned virtual mapping for its page.
*** the first thread is rescheduled ***
- The for(...) loop is resumed, but now cached
data belonging to a different physical page is
being discarded !
And this is not only a preemption issue as ARM can be SMP as well,
making the above scenario just as likely. Hence the need for some kind
of pkmap page pinning which can be used in any context, primarily for
the benefit of dma_map_page() on ARM.
This provides the necessary interface to cope with the above issue if
ARCH_NEEDS_KMAP_HIGH_GET is defined, otherwise the resulting code is
unchanged.
Signed-off-by: Nicolas Pitre <nico@marvell.com>
Reviewed-by: MinChan Kim <minchan.kim@gmail.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Impact: cleanup
Add a new vm flag VM_PFN_AT_MMAP to identify a PFNMAP that is
fully mapped with remap_pfn_range. Patch removes the overloading
of VM_INSERTPAGE from the earlier patch.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Nick Piggin <npiggin@suse.de>
LKML-Reference: <20090313233543.GA19909@linux-os.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
node_to_cpumask (and the blecherous node_to_cpumask_ptr which
contained a declaration) are replaced now everyone implements
cpumask_of_node.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: fix false positive PAT warnings - also fix VirtalBox hang
Use of vma->vm_pgoff to identify the pfnmaps that are fully
mapped at mmap time is broken. vm_pgoff is set by generic mmap
code even for cases where drivers are setting up the mappings
at the fault time.
The problem was originally reported here:
http://marc.info/?l=linux-kernel&m=123383810628583&w=2
Change is_linear_pfn_mapping logic to overload VM_INSERTPAGE
flag along with VM_PFNMAP to mean full PFNMAP setup at mmap
time.
Problem also tracked at:
http://bugzilla.kernel.org/show_bug.cgi?id=12800
Reported-by: Thomas Hellstrom <thellstrom@vmware.com>
Tested-by: Frans Pop <elendil@planet.nl>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha>@intel.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "ebiederm@xmission.com" <ebiederm@xmission.com>
Cc: <stable@kernel.org> # only for 2.6.29.1, not .28
LKML-Reference: <20090313004527.GA7176@linux-os.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Even when page reclaim is under mem_cgroup, # of scan page is determined by
status of global LRU. Fix that.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: remove spurious WARN on legacy SMP percpu allocator
Commit f2a8205c4e incorrectly added too
tight WARN_ON_ONCE() on alignments for UP and legacy SMP percpu
allocator. Commit e317603694 fixed it
for UP but legacy SMP allocator was forgotten. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sachin P. Sant <sachinp@in.ibm.com>
Impact: code reorganization
Separate out embedding first chunk setup helper from x86 embedding
first chunk allocator and put it in mm/percpu.c. This will be used by
the default percpu first chunk allocator and possibly by other archs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup, more flexibility for first chunk init
Non-negative @dyn_size used to be allowed iff @unit_size wasn't auto.
This restriction stemmed from implementation detail and made things a
bit less intuitive. This patch allows @dyn_size to be specified
regardless of @unit_size and swaps the positions of @dyn_size and
@unit_size so that the parameter order makes more sense (static,
reserved and dyn sizes followed by enclosing unit_size).
While at it, add @unit_size >= PCPU_MIN_UNIT_SIZE sanity check.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: generic addr <-> pcpu ptr conversion macros
There's nothing arch specific about x86 __addr_to_pcpu_ptr() and
__pcpu_ptr_to_addr(). With proper __per_cpu_load and __per_cpu_start
defined, they'll do the right thing regardless of actual layout.
Move these macros from arch/x86/include/asm/percpu.h to mm/percpu.c
and allow archs to override it as necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: fix deadlock and allow atomic free
Percpu allocation always uses GFP_KERNEL and whole alloc/free paths
were protected by single mutex. All percpu allocations have been from
GFP_KERNEL-safe context and the original allocator had this assumption
too. However, by protecting both alloc and free paths with the same
mutex, the new allocator creates free -> alloc -> GFP_KERNEL
dependency which the original allocator didn't have. This can lead to
deadlock if free is called from FS or IO paths. Also, in general,
allocators are expected to allow free to be called from atomic
context.
This patch implements finer grained locking to break the deadlock and
allow atomic free. For details, please read the "Synchronization
rules" comment.
While at it, also add CONTEXT: to function comments to describe which
context they expect to be called from and what they do to it.
This problem was reported by Thomas Gleixner and Peter Zijlstra.
http://thread.gmane.org/gmane.linux.kernel/802384
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Peter Zijlstra <peterz@infradead.org>
This is an architecture independant synchronization around kernel text
modifications through use of a global mutex.
A mutex has been chosen so that kprobes, the main user of this, can sleep
during memory allocation between the memory read of the instructions it
must replace and the memory write of the breakpoint.
Other user of this interface: immediate values.
Paravirt and alternatives are always done when SMP is inactive, so there
is no need to use locks.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
LKML-Reference: <49B142D8.7020601@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: code reorganization for later changes
Do fully free chunk reclamation using a work. This change is to
prepare for locking changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: code reorganization for later changes
Separate out chunk area map extension into a separate function -
pcpu_extend_area_map() - and call it directly from pcpu_alloc() such
that pcpu_alloc_area() is guaranteed to have enough area map slots on
invocation.
With this change, pcpu_alloc_area() does only area allocation and the
only failure mode is when the chunk doens't have enough room, so
there's no need to distinguish it from memory allocation failures.
Make it return -1 on such cases instead of hacky -ENOSPC.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: code reorganization for later changes
With static map handling moved to pcpu_split_block(), pcpu_realloc()
only clutters the code and it's also unsuitable for scheduled locking
changes. Implement and use pcpu_mem_alloc/free() instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: add reserved allocation functionality and use it for module
percpu variables
This patch implements reserved allocation from the first chunk. When
setting up the first chunk, arch can ask to set aside certain number
of bytes right after the core static area which is available only
through a separate reserved allocator. This will be used primarily
for module static percpu variables on architectures with limited
relocation range to ensure that the module perpcu symbols are inside
the relocatable range.
If reserved area is requested, the first chunk becomes reserved and
isn't available for regular allocation. If the first chunk also
includes piggy-back dynamic allocation area, a separate chunk mapping
the same region is created to serve dynamic allocation. The first one
is called static first chunk and the second dynamic first chunk.
Although they share the page map, their different area map
initializations guarantee they serve disjoint areas according to their
purposes.
If arch doesn't setup reserved area, reserved allocation is handled
like any other allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: allow sharing page map, no functional difference yet
Make chunk->page access indirect by adding a pointer and renaming the
actual array to page_ar. This will be used by future changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: argument semantic cleanup
In pcpu_setup_first_chunk(), zero @unit_size and @dyn_size meant
auto-sizing. It's okay for @unit_size as 0 doesn't make sense but 0
dynamic reserve size is valid. Alos, if arch @dyn_size is calculated
from other parameters, it might end up passing in 0 @dyn_size and
malfunction when the size is automatically adjusted.
This patch makes both @unit_size and @dyn_size ssize_t and use -1 for
auto sizing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: no functional change
When the first chunk is created, its initial area map is not allocated
because kmalloc isn't online yet. The map is allocated and
initialized on the first allocation request on the chunk. This works
fine but the scattering of initialization logic between the init
function and allocation path is a bit confusing.
This patch makes the first chunk initialize and use minimal statically
allocated map from pcpu_setpu_first_chunk(). The map resizing path
still needs to handle this specially but it's more straight-forward
and gives more latitude to the init path. This will ease future
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cosmetic, preparation for future changes
Make the following renames in pcpur_setup_first_chunk() in preparation
for future changes.
* s/free_size/dyn_size/
* s/static_vm/first_vm/
* s/static_chunk/schunk/
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: standardize IO on cached ops
On modern CPUs it is almost always a bad idea to use non-temporal stores,
as the regression in this commit has shown it:
30d697f: x86: fix performance regression in write() syscall
The kernel simply has no good information about whether using non-temporal
stores is a good idea or not - and trying to add heuristics only increases
complexity and inserts fragility.
The regression on cached write()s took very long to be found - over two
years. So dont take any chances and let the hardware decide how it makes
use of its caches.
The only exception is drivers/gpu/drm/i915/i915_gem.c: there were we are
absolutely sure that another entity (the GPU) will pick up the dirty
data immediately and that the CPU will not touch that data before the
GPU will.
Also, keep the _nocache() primitives to make it easier for people to
experiment with these details. There may be more clear-cut cases where
non-cached copies can be used, outside of filemap.c.
Cc: Salman Qazi <sqazi@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix new breakages introduced by previous fix
Commit c132937556 tried to clean up
bootmem arch wrapper but it wasn't quite correct. Before the commit,
the followings were broken.
* Low level interface functions prefixed with __ ignored arch
preference.
* reserve_bootmem(...) can't be mapped into
reserve_bootmem_node(NODE_DATA(0)->bdata, ...) because the node is
not preference here. The region specified MUST fall into the
specified region; otherwise, it will panic.
After the commit,
* If allocation fails for the arch preferred node, it should fallback
to whatever is available. Instead, it simply failed allocation.
There are too many internal details to allow generic wrapping and
still keep things simple for archs. Plus, all that arch wants is a
way to prefer certain node over another.
This patch drops the generic wrapping around alloc_bootmem_core() and
add alloc_bootmem_core() instead. If necessary, arch can define
bootmem_arch_referred_node() macro or function which takes all
allocation information and returns the preferred node. bootmem
generic code will always try the preferred node first and then
fallback to other nodes as usual.
Breakages noted and changes reviewed by Johannes Weiner.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Impact: remove compile warning
Mark local variable map_end in pcpu_populate_chunk() with
uninitialized_var(). The variable is always used in tandem with
map_start and guaranteed to be initialized before use but gcc doesn't
understand that.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
I just got this new warning from kmemcheck:
WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60)
a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000
f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
^
Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230)
EIP: 0060:[<c1096df7>] EFLAGS: 00000286 CPU: 0
EIP is at __purge_vmap_area_lazy+0x117/0x140
EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66
ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: 00004000 DR7: 00000000
[<c1096f3e>] free_unmap_vmap_area_noflush+0x6e/0x70
[<c1096f6a>] remove_vm_area+0x2a/0x70
[<c1097025>] __vunmap+0x45/0xe0
[<c10970de>] vunmap+0x1e/0x30
[<c1008ba5>] text_poke+0x95/0x150
[<c1008ca9>] alternatives_smp_unlock+0x49/0x60
[<c171ef47>] alternative_instructions+0x11b/0x124
[<c171f991>] check_bugs+0xbd/0xdc
[<c17148c5>] start_kernel+0x2ed/0x360
[<c171409e>] __init_begin+0x9e/0xa9
[<ffffffff>] 0xffffffff
It happened here:
$ addr2line -e vmlinux -i c1096df7
mm/vmalloc.c:540
Code:
list_for_each_entry(va, &valist, purge_list)
__free_vmap_area(va);
It's this instruction:
mov 0x20(%ebx),%edx
Which corresponds to a dereference of va->purge_list.next:
(gdb) p ((struct vmap_area *) 0)->purge_list.next
Cannot access memory at address 0x20
It seems that we should use "safe" list traversal here, as the element
is freed inside the loop. Please verify that this is the right fix.
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new vmap allocator can wrap the address and get confused in the case
of large allocations or VMALLOC_END near the end of address space.
Problem reported by Christoph Hellwig on a 32-bit XFS workload.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reported-by: Christoph Hellwig <hch@lst.de>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each time I exit Firefox, /proc/meminfo's Committed_AS goes down almost
400 kB: OVERCOMMIT_NEVER would be allowing overcommits it should
prohibit.
Commit fc8744adc8 "Stop playing silly
games with the VM_ACCOUNT flag" changed shmem_file_setup() to set the
shmem file's VM_ACCOUNT flag according to VM_NORESERVE not being set in
the vma flags; but did so only _after_ the shmem_acct_size(flags, size)
call which is expected to pre-account a shared anonymous object.
It's all clearer if we switch shmem.c over to use VM_NORESERVE
throughout in place of !VM_ACCOUNT.
But I very nearly sent in a patch which mistakenly removed the
accounting from tmpfs files: shmem_get_inode()'s memset was good for not
setting VM_ACCOUNT, but now it needs to set VM_NORESERVE.
Rather than setting that by default, then perhaps clearing it again in
shmem_file_setup(), let's pass it as a flag to shmem_get_inode(): that
allows us to remove the #ifdef CONFIG_SHMEM from shmem_file_setup().
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanup, enable future change
Add a 'total bytes copied' parameter to __copy_from_user_*nocache(),
and update all the callsites.
The parameter is not used yet - architecture code can use it to
more intelligently decide whether the copy should be cached or
non-temporal.
Cc: Salman Qazi <sqazi@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As suggested by Christoph Lameter, rename calculate_min_partial() to
set_min_partial() as the function doesn't really do any calculations.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Most global variables in percpu allocator are initialized during boot
and read only from that point on. Add __read_mostly as per Rusty's
suggestion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Impact: more latitude for first percpu chunk allocation
The first percpu chunk serves the kernel static percpu area and may or
may not contain extra room for further dynamic allocation.
Initialization of the first chunk needs to be done before normal
memory allocation service is up, so it has its own init path -
pcpu_setup_static().
It seems archs need more latitude while initializing the first chunk
for example to take advantage of large page mapping. This patch makes
the following changes to allow this.
* Define PERCPU_DYNAMIC_RESERVE to give arch hint about how much space
to reserve in the first chunk for further dynamic allocation.
* Rename pcpu_setup_static() to pcpu_setup_first_chunk().
* Make pcpu_setup_first_chunk() much more flexible by fetching page
pointer by callback and adding optional @unit_size, @free_size and
@base_addr arguments which allow archs to selectively part of chunk
initialization to their likings.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: allow unit_size to be arbitrary multiple of PAGE_SIZE
In dynamic percpu allocator, there is no reason the unit size should
be power of two. Remove the restriction.
As non-power-of-two unit size means that empty chunks fall into the
same slot index as lightly occupied chunks which is bad for reclaming.
Reserve an extra slot for empty chunks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: allow larger alignment for early vmalloc area allocation
Some early vmalloc users might want larger alignment, for example, for
custom large page mapping. Add @align to vm_area_register_early().
While at it, drop docbook comment on non-existent @size.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Impact: cleaner and consistent bootmem wrapping
By setting CONFIG_HAVE_ARCH_BOOTMEM_NODE, archs can define
arch-specific wrappers for bootmem allocation. However, this is done
a bit strangely in that only the high level convenience macros can be
changed while lower level, but still exported, interface functions
can't be wrapped. This not only is messy but also leads to strange
situation where alloc_bootmem() does what the arch wants it to do but
the equivalent __alloc_bootmem() call doesn't although they should be
able to be used interchangeably.
This patch updates bootmem such that archs can override / wrap the
backend function - alloc_bootmem_core() instead of the highlevel
interface functions to allow simpler and consistent wrapping. Also,
HAVE_ARCH_BOOTMEM_NODE is renamed to HAVE_ARCH_BOOTMEM.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@saeurebad.de>
Impact: fix short allocation leading to memory corruption
While dropping rvalue wrapping macros around global parameters,
pcpu_chunk_struct_size was set incorrectly resulting in shorter page
pointer array. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Although it allows for better cacheline use, it is unnecessary to save a
copy of the cache's min_partial value in each kmem_cache_node.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
* hibernate:
PM: Fix suspend_console and resume_console to use only one semaphore
PM: Wait for console in resume
PM: Fix pm_notifiers during user mode hibernation
swsusp: clean up shrink_all_zones()
swsusp: dont fiddle with swappiness
PM: fix build for CONFIG_PM unset
PM/hibernate: fix "swap breaks after hibernation failures"
PM/resume: wait for device probing to finish
Consolidate driver_probe_done() loops into one place
Move local variables to innermost possible scopes and use local
variables to cache calculations/reads done more than once.
No change in functionality (intended).
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sc.swappiness is not used in the swsusp memory shrinking path, do not
set it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
http://bugzilla.kernel.org/show_bug.cgi?id=12239
The image writing code dropped a reference to the current swap device.
This doesn't show up if the hibernation succeeds - because it doesn't
affect the image which gets resumed. But it means multiple _failed_
hibernations end up freeing the swap device while it is still use!
swsusp_write() finds the block device for the swap file using swap_type_of().
It then uses blkdev_get() / blkdev_put() to open and close the block device.
Unfortunately, blkdev_get() assumes ownership of the inode of the block_device
passed to it. So blkdev_put() calls iput() on the inode. This is by design
and other callers expect this behaviour. The fix is for swap_type_of() to take
a reference on the inode using bdget().
Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew was concerned about the unit of variables named or have suffix
size. Every usage in percpu allocator is in bytes but make it super
clear by adding comments.
While at it, make pcpu_depopulate_chunk() take int @off and @size like
everyone else.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Impact: proper vcache flush on unmap_kernel_range()
flush_cache_vunmap() should be called before pages are unmapped. Add
a call to it in unmap_kernel_range().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kzfree() is a wrapper for kfree() that additionally zeroes the underlying
memory before releasing it to the slab allocator.
Currently there is code which memset()s the memory region of an object
before releasing it back to the slab allocator to make sure
security-sensitive data are really zeroed out after use.
These callsites can then just use kzfree() which saves some code, makes
users greppable and allows for a stupid destructor that isn't necessarily
aware of the actual object size.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As a preparational patch to bump up page allocator pass-through threshold,
introduce two new constants SLUB_MAX_SIZE and SLUB_PAGE_SHIFT and convert
mm/slub.c to use them.
Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>