Up to this point the code assumed old refcounting for hugepages (pre-thp).
This updates the code directly to the thp mapcount tail page refcounting.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We only taken "refs" pins on the head page not "*nr" pins.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"page" may have changed to point to the next hugepage after the loop
completed, The references have been taken on the head page, so the
put_page must happen there too.
This is a longstanding issue pre-thp inclusion.
It's totally unclear how these page_cache_add_speculative and
pte_val(pte) != pte_val(*ptep) checks are necessary across all the
powerpc gup_fast code, when x86 doesn't need any of that: there's no way
the page can be freed with irq disabled so we're guaranteed the
atomic_inc will happen on a page with page_count > 0 (so not needing the
speculative check).
The pte check is also meaningless on x86: no need to rollback on x86 if
the pte changed, because the pte can still change a CPU tick after the
check succeeded and it won't be rolled back in that case. The important
thing is we got a reference on a valid page that was mapped there a CPU
tick ago. So not knowing the soft tlb refill code of ppc64 in great
detail I'm not removing the "speculative" page_count increase and the
pte checks across all the code, but unless there's a strong reason for
it they should be later cleaned up too.
If a pte can change from huge to non-huge (like it could happen with
THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat
the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs
only so I'm not altering that.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
those should be handled by gup_hugepd only, so these checks are
superfluous.
Plus if this wasn't a noop, it would have oopsed because, the insistence
of using the speculative refcounting would trigger a VM_BUG_ON if a tail
page was encountered in the page_cache_get_speculative().
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().
He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().
So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).
While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).
The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages. The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it. A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.
This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'next/soc' of git://git.linaro.org/people/arnd/arm-soc: (21 commits)
MAINTAINERS: add ARM/FREESCALE IMX6 entry
arm/imx: merge i.MX3 and i.MX6
arm/imx6q: add suspend/resume support
arm/imx6q: add device tree machine support
arm/imx6q: add smp and cpu hotplug support
arm/imx6q: add core drivers clock, gpc, mmdc and src
arm/imx: add gic_handle_irq function
arm/imx6q: add core definitions and low-level debug uart
arm/imx6q: add device tree source
ARM: highbank: add suspend support
ARM: highbank: Add cpu hotplug support
ARM: highbank: add SMP support
MAINTAINERS: add Calxeda Highbank ARM platform
ARM: add Highbank core platform support
ARM: highbank: add devicetree source
ARM: l2x0: add empty l2x0_of_init
picoxcell: add a definition of VMALLOC_END
picoxcell: remove custom ioremap implementation
picoxcell: add the DTS for the PC7302 board
picoxcell: add the DTS for pc3x2 and pc3x3 devices
...
Fix up trivial conflicts in arch/arm/Kconfig, and some more header file
conflicts in arch/arm/mach-omap2/board-generic.c (as per an ealier merge
by Arnd).
* 'next/dt' of git://git.linaro.org/people/arnd/arm-soc:
ARM: gic: use module.h instead of export.h
ARM: gic: fix irq_alloc_descs handling for sparse irq
ARM: gic: add OF based initialization
ARM: gic: add irq_domain support
irq: support domains with non-zero hwirq base
of/irq: introduce of_irq_init
ARM: at91: add at91sam9g20 and Calao USB A9G20 DT support
ARM: at91: dt: at91sam9g45 family and board device tree files
arm/mx5: add device tree support for imx51 babbage
arm/mx5: add device tree support for imx53 boards
ARM: msm: Add devicetree support for msm8660-surf
msm_serial: Add devicetree support
msm_serial: Use relative resources for iomem
Fix up conflicts in arch/arm/mach-at91/{at91sam9260.c,at91sam9g45.c}
* 'next/cleanup2' of git://git.linaro.org/people/arnd/arm-soc: (31 commits)
ARM: OMAP: Warn if omap_ioremap is called before SoC detection
ARM: OMAP: Move set_globals initialization to happen in init_early
ARM: OMAP: Map SRAM later on with ioremap_exec()
ARM: OMAP: Remove calls to SRAM allocations for framebuffer
ARM: OMAP: Avoid cpu_is_omapxxxx usage until map_io is done
ARM: OMAP1: Use generic map_io, init_early and init_irq
arm/dts: OMAP3+: Add mpu, dsp and iva nodes
arm/dts: OMAP4: Add a main ocp entry bound to l3-noc driver
ARM: OMAP2+: l3-noc: Add support for device-tree
ARM: OMAP2+: board-generic: Add i2c static init
ARM: OMAP2+: board-generic: Add DT support to generic board
arm/dts: Add support for OMAP3 Beagle board
arm/dts: Add initial device tree support for OMAP3 SoC
arm/dts: Add support for OMAP4 SDP board
arm/dts: Add support for OMAP4 PandaBoard
arm/dts: Add initial device tree support for OMAP4 SoC
ARM: OMAP: omap_device: Add a method to build an omap_device from a DT node
ARM: OMAP: omap_device: Add omap_device_[alloc|delete] for DT integration
of: Add helpers to get one string in multiple strings property
ARM: OMAP2+: devices: Remove all omap_device_pm_latency structures
...
Fix up trivial header file conflicts in arch/arm/mach-omap2/board-generic.c
* 'next/cross-platform' of git://git.linaro.org/people/arnd/arm-soc:
arm/imx: use Kconfig choice for low-level debug UART selection
ARM: realview: use Kconfig choice for debug UART selection
ARM: plat-samsung: use Kconfig choice for debug UART selection
ARM: versatile: convert logical CPU numbers to physical numbers
ARM: ux500: convert logical CPU numbers to physical numbers
ARM: shmobile: convert logical CPU numbers to physical numbers
ARM: msm: convert logical CPU numbers to physical numbers
ARM: exynos4: convert logical CPU numbers to physical numbers
Fix up trivial conflict (config DEBUG_S3C_UART move/split vs addition of
ARM_KPROBES_TEST option) in arch/arm/Kconfig.debug
* 'next/devel' of git://git.linaro.org/people/arnd/arm-soc: (50 commits)
ARM: tegra: update defconfig
arm/tegra: Harmony: Configure PMC for low-level interrupts
arm/tegra: device tree support for ventana board
arm/tegra: add support for ventana pinmuxing
arm/tegra: prepare Seaboard pinmux code for derived boards
arm/tegra: pinmux: ioremap registers
gpio/tegra: Convert to a platform device
arm/tegra: Convert pinmux driver to a platform device
arm/dt: Tegra: Add pinmux node to tegra20.dtsi
arm/tegra: Prep boards for gpio/pinmux conversion to pdevs
ARM: mx5: fix clock usage for suspend
ARM i.MX entry-macro.S: remove now unused code
ARM i.MX boards: use CONFIG_MULTI_IRQ_HANDLER
ARM i.MX tzic: add handle_irq function
ARM i.MX avic: add handle_irq function
ARM: mx25: Add the missing IIM base definition
ARM i.MX avic: convert to use generic irq chip
mx31moboard: Add poweroff support
ARM: mach-qong: Add watchdog support
ARM: davinci: AM18x: Add wl1271/wlan support
...
Fix up conflicts in:
arch/arm/mach-at91/at91sam9g45.c
arch/arm/mach-mx5/devices-imx53.h
arch/arm/plat-mxc/include/mach/memory.h
* 'next/board' of git://git.linaro.org/people/arnd/arm-soc: (34 commits)
ep93xx: add support Vision EP9307 SoM
ARM: mxs: Add initial support for DENX MX28
ARM: EXYNOS4: Add support SMDK4412 Board
ARM: EXYNOS4: Add MCT support for EXYNOS4412
ARM: EXYNOS4: Add functions for gic interrupt handling
ARM: EXYNOS4: Add support clock for EXYNOS4412
ARM: EXYNOS4: Add support new EXYNOS4412 SoC
ARM: EXYNOS4: Add support MCT PPI for EXYNOS4212
ARM: EXYNOS4: Add support PPI in external GIC
ARM: EXYNOS4: convert boot_params to atag_offset
ixp4xx: support omicron ixp425 based boards
ARM: EXYNOS4: Add support SMDK4212 Board
ARM: EXYNOS4: Add support PM for EXYNOS4212
ARM: EXYNOS4: Add support clock for EXYNOS4212
ARM: EXYNOS4: Add support new EXYNOS4212 SoC
at91: USB-A9G20 C01 & C11 board support
at91: merge board USB-A9260 and USB-A9263 together
at91: add support for RSIs EWS board
ARM: SAMSUNG: Fix mask value for S5P64X0 CPU IDs
ARM: SAMSUNG: Fix mask for S3C64xx CPU IDs
...
* 'next/deletion' of git://git.linaro.org/people/arnd/arm-soc:
ARM: mach-nuc93x: delete
Fix up trivial delete/edit conflicts in
arch/arm/mach-nuc93x/{Makefile.boot,mach-nuc932evb.c,time.c}
* 'next/pm' of git://git.linaro.org/people/arnd/arm-soc: (66 commits)
ARM: CSR: PM: use outer_resume to resume L2 cache
ARM: CSR: call l2x0_of_init to init L2 cache of SiRFprimaII
ARM: OMAP: voltage: voltage layer present, even when CONFIG_PM=n
ARM: CSR: PM: add sleep entry for SiRFprimaII
ARM: CSR: PM: save/restore irq status in suspend cycle
ARM: CSR: PM: save/restore timer status in suspend cycle
OMAP4: PM: TWL6030: add cmd register
OMAP4: PM: TWL6030: fix ON/RET/OFF voltages
OMAP4: PM: TWL6030: address 0V conversions
OMAP4: PM: TWL6030: fix uv to voltage for >0x39
OMAP4: PM: TWL6030: fix voltage conversion formula
omap: voltage: add a stub header file for external/regulator use
OMAP2+: VC: more registers are per-channel starting with OMAP5
OMAP3+: voltage: update nominal voltage in voltdm_scale() not VC post-scale
OMAP3+: voltage: rename omap_voltage_get_nom_volt -> voltdm_get_voltage
OMAP3+: voltdm: final removal of omap_vdd_info
OMAP3+: voltage: move/rename curr_volt from vdd_info into struct voltagedomain
OMAP3+: voltage: rename scale and reset functions using voltdm_ prefix
OMAP3+: VP: combine setting init voltage into common function
OMAP3+: VP: remove unused omap_vp_get_curr_volt()
...
Fix up trivial conflict in arch/arm/mach-prima2/l2x0.c (code removal vs
edit)
* 'next/driver' of git://git.linaro.org/people/arnd/arm-soc:
hw_random: add driver for atmel true hardware random number generator
ARM: at91: at91sam9g45: add trng clock and platform device
MX53 Enable the AHCI SATA on MX53 SMD board
MX53 Enable the AHCI SATA on MX53 LOCO board
MX53 Enable the AHCI SATA on MX53 ARD board
AHCI Add the AHCI SATA feature on the MX53 platforms
Fix pata imx resource
ARM: imx: Define functions for registering PATA
ARM: imx: Add PATA clock support
ARM: imx: Add PATA resources for other i.MX processors
imx: efika: Enable pata.
imx51: add pata clock
imx51: add pata device
Fix up trivial conflict (new selects next to each other from separate
branches for EFIKA_COMMON) in arch/arm/mach-mx5/Kconfig
* 'next/fixes' of git://git.linaro.org/people/arnd/arm-soc: (28 commits)
ARM: pxa/cm-x300: properly set bt_reset pin
ARM: mmp: rename SHEEVAD to GPLUGD
ARM: imx: Fix typo 'MACH_MX31_3DS_MXC_NAND_USE_BBT'
ARM: i.MX28: shift frac value in _CLK_SET_RATE
plat-mxc: iomux-v3.h: implicitly enable pull-up/down when that's desired
ARM: mx5: fix clock usage for suspend
ARM: pxa: use correct __iomem annotations
ARM: pxa: sharpsl pm needs SPI
ARM: pxa: centro and treo680 need palm27x
ARM: pxa: make pxafb_smart_*() empty when not enabled
ARM: pxa: select POWER_SUPPLY on raumfeld
ARM: pxa: pxa95x is incompatible with earlier pxa
ARM: pxa: CPU_FREQ_TABLE is needed for CPU_FREQ
ARM: pxa: pxa95x/saarb depends on pxa3xx code
ARM: pxa: allow selecting just one of TREO680/CENTRO
ARM: pxa: export symbols from pxa3xx-ulpi
ARM: pxa: make zylonite_pxa*_init declaration match code
ARM: pxa/z2: fix building error of pxa27x_cpu_suspend() no longer available
ARM: at91: add defconfig for at91sam9g45 family
ARM: at91: remove dependency for Atmel PWM driver selector in Kconfig
...
* 'for-linus/i2c-3.2' of git://git.fluff.org/bjdooks/linux: (47 commits)
i2c-s3c2410: Add device tree support
i2c-s3c2410: Keep a copy of platform data and use it
i2c-nomadik: cosmetic coding style corrections
i2c-au1550: dev_pm_ops conversion
i2c-au1550: increase timeout waiting for master done
i2c-au1550: remove unused ack_timeout
i2c-au1550: remove usage of volatile keyword
i2c-tegra: __iomem annotation fix
i2c-eg20t: Add initialize processing in case i2c-error occurs
i2c-eg20t: Fix flag setting issue
i2c-eg20t: add stop sequence in case wait-event timeout occurs
i2c-eg20t: Separate error processing
i2c-eg20t: Fix 10bit access issue
i2c-eg20t: Modify returned value s32 to long
i2c-eg20t: Fix bus-idle waiting issue
i2c-designware: Fix PCI core warning on suspend/resume
i2c-designware: Add runtime power management support
i2c-designware: Add support for Designware core behind PCI devices.
i2c-designware: Push all register reads/writes into the core code.
i2c-designware: Support multiple cores using same ISR
...
* git://github.com/herbertx/crypto: (48 commits)
crypto: user - Depend on NET instead of selecting it
crypto: user - Add dependency on NET
crypto: talitos - handle descriptor not found in error path
crypto: user - Initialise match in crypto_alg_match
crypto: testmgr - add twofish tests
crypto: testmgr - add blowfish test-vectors
crypto: Make hifn_795x build depend on !ARCH_DMA_ADDR_T_64BIT
crypto: twofish-x86_64-3way - fix ctr blocksize to 1
crypto: blowfish-x86_64 - fix ctr blocksize to 1
crypto: whirlpool - count rounds from 0
crypto: Add userspace report for compress type algorithms
crypto: Add userspace report for cipher type algorithms
crypto: Add userspace report for rng type algorithms
crypto: Add userspace report for pcompress type algorithms
crypto: Add userspace report for nivaead type algorithms
crypto: Add userspace report for aead type algorithms
crypto: Add userspace report for givcipher type algorithms
crypto: Add userspace report for ablkcipher type algorithms
crypto: Add userspace report for blkcipher type algorithms
crypto: Add userspace report for ahash type algorithms
...
The board changes in the imx/devel branch conflict with other changes in
the device imx/dt branch.
Conflicts:
arch/arm/mach-mx5/board-mx53_loco.c
arch/arm/mach-mx5/board-mx53_smd.c
arch/arm/plat-mxc/include/mach/common.h
arch/arm/plat-mxc/include/mach/memory.h
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This is the fifth version of the patchset (with one tiny whitespace fix)
to the Linux kernel to support the Qualcomm Hexagon architecture.
Between now and the next pull requests, Richard Kuo should have his key
signed, etc., and should be back on kernel.org. In the meantime, this
got merged as a emailed patch-series.
* Hexagon: (36 commits)
Add extra arch overrides to asm-generic/checksum.h
Hexagon: Add self to MAINTAINERS
Hexagon: Add basic stacktrace functionality for Hexagon architecture.
Hexagon: Add configuration and makefiles for the Hexagon architecture.
Hexagon: Comet platform support
Hexagon: kgdb support files
Hexagon: Add page-fault support.
Hexagon: Add page table header files & etc.
Hexagon: Add ioremap support
Hexagon: Provide DMA implementation
Hexagon: Implement basic TLB management routines for Hexagon.
Hexagon: Implement basic cache-flush support
Hexagon: Provide basic implementation and/or stubs for I/O routines.
Hexagon: Add user access functions
Hexagon: Add locking types and functions
Hexagon: Add SMP support
Hexagon: Provide basic debugging and system trap support.
Hexagon: Add ptrace support
Hexagon: Add time and timer functions
Hexagon: Add interrupts
...
Mostly all stubs, as the TLB is managed by the hypervisor.
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Modules should be compiled as ordinary .o's; shared objects are not
supported.
Signed-off-by: Linas Vepstas <linas@codeaurora.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>