Commit Graph

1936 Commits

Author SHA1 Message Date
Linus Torvalds
b69f9e17a5 powerpc fixes for 4.20 #2
Some things that I missed due to travel, or that came in late.
 
 Two fixes also going to stable:
 
  - A revert of a buggy change to the 8xx TLB miss handlers.
 
  - Our flushing of SPE (Signal Processing Engine) registers on fork was broken.
 
 Other changes:
 
  - A change to the KVM decrementer emulation to use proper APIs.
 
  - Some cleanups to the way we do code patching in the 8xx code.
 
  - Expose the maximum possible memory for the system in /proc/powerpc/lparcfg.
 
  - Merge some updates from Scott: "a couple device tree updates, and a fix for a
    missing prototype warning."
 
 A few other minor fixes and a handful of fixes for our selftests.
 
 Thanks to:
   Aravinda Prasad, Breno Leitao, Camelia Groza, Christophe Leroy, Felipe Rechia,
   Joel Stanley, Naveen N. Rao, Paul Mackerras, Scott Wood, Tyrel Datwyler.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJb3C+uAAoJEFHr6jzI4aWAPJgQAIX0aD/PiYfEUI/rm/Q0vnJI
 HO3FCKroi+LVF/URU24+NLA/1KGCBfO9by9m6D/nnmHl+vi2P69fFgokywO0Ajru
 nf9a+9Gx53IbO7EEUf1fZVwxCMobBqU8eWq1hIBndd5HTz9QEftc/RXgpgqZcQ/x
 x8xN1FNMSUT9NwMk750QDO7CFBrSfSjFC+/WrkViBaMiRWx2rwle+2tDipQ/fegY
 Tsu4wg0qEzWMT//MFP4yOlkcLV8M6d2Sw65Km59rWHA1I2wqsTek1yQ68Epo9Co0
 RdJh9Nt1kjLC5XXteneFhe18UUPKRmrXbYDFByw5CUhs5VI99Dq4w5kamh197XLr
 +jA3XHAeAyaXf21I9zmmZXbhHanowCPZGyzZqZXWJ86bVJp5v328wXmnxtKrb0Nz
 pH7fjQ6zjzsZgIcN9i2CFpIvuDQ/z1A/QyHdBnRvJ8HoXlTerZCn22JTgY7d2VJu
 XJn1n+VABG2BrzJexW/7quY3Z7V6tvdkloWwOA3PdAwkcoImd4BfYq9K2DU//diN
 BnXPnDs2K7JwDG9s+cgUEHnrP6DOKsxT+mmYpqXf0Ta0wtyoZJ1zgSdfZAUOmjnb
 MhcK46l+F4E891qjnsuuVNnspqI4yPMLmAGmife5OUrfcoFdm/vFbM75FaXameBx
 cMOmidrJJhO6z5eWSKvO
 =VCkW
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Some things that I missed due to travel, or that came in late.

  Two fixes also going to stable:

   - A revert of a buggy change to the 8xx TLB miss handlers.

   - Our flushing of SPE (Signal Processing Engine) registers on fork
     was broken.

  Other changes:

   - A change to the KVM decrementer emulation to use proper APIs.

   - Some cleanups to the way we do code patching in the 8xx code.

   - Expose the maximum possible memory for the system in
     /proc/powerpc/lparcfg.

   - Merge some updates from Scott: "a couple device tree updates, and a
     fix for a missing prototype warning"

  A few other minor fixes and a handful of fixes for our selftests.

  Thanks to: Aravinda Prasad, Breno Leitao, Camelia Groza, Christophe
  Leroy, Felipe Rechia, Joel Stanley, Naveen N. Rao, Paul Mackerras,
  Scott Wood, Tyrel Datwyler"

* tag 'powerpc-4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (21 commits)
  selftests/powerpc: Fix compilation issue due to asm label
  selftests/powerpc/cache_shape: Fix out-of-tree build
  selftests/powerpc/switch_endian: Fix out-of-tree build
  selftests/powerpc/pmu: Link ebb tests with -no-pie
  selftests/powerpc/signal: Fix out-of-tree build
  selftests/powerpc/ptrace: Fix out-of-tree build
  powerpc/xmon: Relax frame size for clang
  selftests: powerpc: Fix warning for security subdir
  selftests/powerpc: Relax L1d miss targets for rfi_flush test
  powerpc/process: Fix flush_all_to_thread for SPE
  powerpc/pseries: add missing cpumask.h include file
  selftests/powerpc: Fix ptrace tm failure
  KVM: PPC: Use exported tb_to_ns() function in decrementer emulation
  powerpc/pseries: Export maximum memory value
  powerpc/8xx: Use patch_site for perf counters setup
  powerpc/8xx: Use patch_site for memory setup patching
  powerpc/code-patching: Add a helper to get the address of a patch_site
  Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP"
  powerpc/8xx: add missing header in 8xx_mmu.c
  powerpc/8xx: Add DT node for using the SEC engine of the MPC885
  ...
2018-11-02 09:19:35 -07:00
Mike Rapoport
7e1c4e2792 memblock: stop using implicit alignment to SMP_CACHE_BYTES
When a memblock allocation APIs are called with align = 0, the alignment
is implicitly set to SMP_CACHE_BYTES.

Implicit alignment is done deep in the memblock allocator and it can
come as a surprise.  Not that such an alignment would be wrong even
when used incorrectly but it is better to be explicit for the sake of
clarity and the prinicple of the least surprise.

Replace all such uses of memblock APIs with the 'align' parameter
explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
in the memblock internal allocation functions.

For the case when memblock APIs are used via helper functions, e.g.  like
iommu_arena_new_node() in Alpha, the helper functions were detected with
Coccinelle's help and then manually examined and updated where
appropriate.

The direct memblock APIs users were updated using the semantic patch below:

@@
expression size, min_addr, max_addr, nid;
@@
(
|
- memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
|
- memblock_alloc(size, 0)
+ memblock_alloc(size, SMP_CACHE_BYTES)
|
- memblock_alloc_raw(size, 0)
+ memblock_alloc_raw(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from(size, 0, min_addr)
+ memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_nopanic(size, 0)
+ memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low(size, 0)
+ memblock_alloc_low(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low_nopanic(size, 0)
+ memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from_nopanic(size, 0, min_addr)
+ memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_node(size, 0, nid)
+ memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
)

[mhocko@suse.com: changelog update]
[akpm@linux-foundation.org: coding-style fixes]
[rppt@linux.ibm.com: fix missed uses of implicit alignment]
  Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Paul Burton <paul.burton@mips.com>	[MIPS]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
57c8a661d9 mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
  Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
c6ffc5ca8f memblock: rename free_all_bootmem to memblock_free_all
The conversion is done using

sed -i 's@free_all_bootmem@memblock_free_all@' \
    $(git grep -l free_all_bootmem)

Link: http://lkml.kernel.org/r/1536927045-23536-26-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
eb31d559f1 memblock: remove _virt from APIs returning virtual address
The conversion is done using

sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
	$(git grep -l memblock_virt_alloc)

Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Mike Rapoport
9a8dd708d5 memblock: rename memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*
Make it explicit that the caller gets a physical address rather than a
virtual one.

This will also allow using meblock_alloc prefix for memblock allocations
returning virtual address, which is done in the following patches.

The conversion is done using the following semantic patch:

@@
expression e1, e2, e3;
@@
(
- memblock_alloc(e1, e2)
+ memblock_phys_alloc(e1, e2)
|
- memblock_alloc_nid(e1, e2, e3)
+ memblock_phys_alloc_nid(e1, e2, e3)
|
- memblock_alloc_try_nid(e1, e2, e3)
+ memblock_phys_alloc_try_nid(e1, e2, e3)
)

Link: http://lkml.kernel.org/r/1536927045-23536-7-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Michael Ellerman
3b9672fff7 Merge branch 'next' of https://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next
Updates from Scott:
  "This contains a couple device tree updates, and a fix for a missing
   prototype warning."
2018-10-29 22:32:52 +11:00
Linus Torvalds
685f7e4f16 powerpc updates for 4.20
Notable changes:
 
  - A large series to rewrite our SLB miss handling, replacing a lot of fairly
    complicated asm with much fewer lines of C.
 
  - Following on from that, we now maintain a cache of SLB entries for each
    process and preload them on context switch. Leading to a 27% speedup for our
    context switch benchmark on Power9.
 
  - Improvements to our handling of SLB multi-hit errors. We now print more debug
    information when they occur, and try to continue running by flushing the SLB
    and reloading, rather than treating them as fatal.
 
  - Enable THP migration on 64-bit Book3S machines (eg. Power7/8/9).
 
  - Add support for physical memory up to 2PB in the linear mapping on 64-bit
    Book3S. We only support up to 512TB as regular system memory, otherwise the
    percpu allocator runs out of vmalloc space.
 
  - Add stack protector support for 32 and 64-bit, with a per-task canary.
 
  - Add support for PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP.
 
  - Support recognising "big cores" on Power9, where two SMT4 cores are presented
    to us as a single SMT8 core.
 
  - A large series to cleanup some of our ioremap handling and PTE flags.
 
  - Add a driver for the PAPR SCM (storage class memory) interface, allowing
    guests to operate on SCM devices (acked by Dan).
 
  - Changes to our ftrace code to handle very large kernels, where we need to use
    a trampoline to get to ftrace_caller().
 
 Many other smaller enhancements and cleanups.
 
 Thanks to:
   Alan Modra, Alistair Popple, Aneesh Kumar K.V, Anton Blanchard, Aravinda
   Prasad, Bartlomiej Zolnierkiewicz, Benjamin Herrenschmidt, Breno Leitao,
   Cédric Le Goater, Christophe Leroy, Christophe Lombard, Dan Carpenter, Daniel
   Axtens, Finn Thain, Gautham R. Shenoy, Gustavo Romero, Haren Myneni, Hari
   Bathini, Jia Hongtao, Joel Stanley, John Allen, Laurent Dufour, Madhavan
   Srinivasan, Mahesh Salgaonkar, Mark Hairgrove, Masahiro Yamada, Michael
   Bringmann, Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Nathan
   Fontenot, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran,
   Paul Mackerras, Petr Vorel, Rashmica Gupta, Reza Arbab, Rob Herring, Sam
   Bobroff, Samuel Mendoza-Jonas, Scott Wood, Stan Johnson, Stephen Rothwell,
   Stewart Smith, Suraj Jitindar Singh, Tyrel Datwyler, Vaibhav Jain, Vasant
   Hegde, YueHaibing, zhong jiang,
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJb01vTAAoJEFHr6jzI4aWADsEP/jqL3+2qxs098ra80tpXCpXJ
 tgXCosEs4b35sGtyHeUWZZZfWXeisaPAIlP8zTx1n50HACZduDYRAl0Ew9XB7Xdw
 enDHRVccD21FsmHBOx/Ii1rVJlovWlj6EQCWHKeZmNjeRoFuClVZ7CYmf+mBifKR
 sw2Db2fKA/59wMTq2zIMy5pqYgqlAs4jTWS6uN5hKPoBmO/82ARnNG+qgLuloD3Z
 O8zSDM9QQ7PpuyDgTjO9SAo2YjmEfXlEG6cOCCejsU3DMctaEAK5PUZ+blsHYHBH
 BYZYKs/x4pcw0SO41GtTh0M2YqDYBVuBIpRw8lLZap97Xo9ucSkAm5WD3rGxk4CY
 YeZKEPUql6MHN3+DKl8mx2F0V+Et/tio2HNqc9KReR1tfoolZAbe+SFZHfgmc/Rq
 RD9nnG8KRd4K2K1BTqpkTmI1EtE7jPtPJPSV8gMGhgL/N5vPmH3mql/qyOtYx48E
 6/hPzWESgs16VRZ/opLh8VvjlY1HBDODQhehhhl+o23/Vb8qEgRf8Uqhq50rQW1H
 EeOqyyYQ90txSU31Sgy1kQkvOgIFAsBObWT1ZCJ3RbfGbB4/tdEAvZqTZRlXo2OY
 7P0Sqcw/9Le5eJkHIlLtBv0TF7y1OYemCbLgRQzFlcRP+UKtYyg8eFnFjqbPEEmP
 ulwhn/BfFVSgaYKQ503u
 =I0pj
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - A large series to rewrite our SLB miss handling, replacing a lot of
     fairly complicated asm with much fewer lines of C.

   - Following on from that, we now maintain a cache of SLB entries for
     each process and preload them on context switch. Leading to a 27%
     speedup for our context switch benchmark on Power9.

   - Improvements to our handling of SLB multi-hit errors. We now print
     more debug information when they occur, and try to continue running
     by flushing the SLB and reloading, rather than treating them as
     fatal.

   - Enable THP migration on 64-bit Book3S machines (eg. Power7/8/9).

   - Add support for physical memory up to 2PB in the linear mapping on
     64-bit Book3S. We only support up to 512TB as regular system
     memory, otherwise the percpu allocator runs out of vmalloc space.

   - Add stack protector support for 32 and 64-bit, with a per-task
     canary.

   - Add support for PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP.

   - Support recognising "big cores" on Power9, where two SMT4 cores are
     presented to us as a single SMT8 core.

   - A large series to cleanup some of our ioremap handling and PTE
     flags.

   - Add a driver for the PAPR SCM (storage class memory) interface,
     allowing guests to operate on SCM devices (acked by Dan).

   - Changes to our ftrace code to handle very large kernels, where we
     need to use a trampoline to get to ftrace_caller().

  And many other smaller enhancements and cleanups.

  Thanks to: Alan Modra, Alistair Popple, Aneesh Kumar K.V, Anton
  Blanchard, Aravinda Prasad, Bartlomiej Zolnierkiewicz, Benjamin
  Herrenschmidt, Breno Leitao, Cédric Le Goater, Christophe Leroy,
  Christophe Lombard, Dan Carpenter, Daniel Axtens, Finn Thain, Gautham
  R. Shenoy, Gustavo Romero, Haren Myneni, Hari Bathini, Jia Hongtao,
  Joel Stanley, John Allen, Laurent Dufour, Madhavan Srinivasan, Mahesh
  Salgaonkar, Mark Hairgrove, Masahiro Yamada, Michael Bringmann,
  Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Nathan
  Fontenot, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Oliver
  O'Halloran, Paul Mackerras, Petr Vorel, Rashmica Gupta, Reza Arbab,
  Rob Herring, Sam Bobroff, Samuel Mendoza-Jonas, Scott Wood, Stan
  Johnson, Stephen Rothwell, Stewart Smith, Suraj Jitindar Singh, Tyrel
  Datwyler, Vaibhav Jain, Vasant Hegde, YueHaibing, zhong jiang"

* tag 'powerpc-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (221 commits)
  Revert "selftests/powerpc: Fix out-of-tree build errors"
  powerpc/msi: Fix compile error on mpc83xx
  powerpc: Fix stack protector crashes on CPU hotplug
  powerpc/traps: restore recoverability of machine_check interrupts
  powerpc/64/module: REL32 relocation range check
  powerpc/64s/radix: Fix radix__flush_tlb_collapsed_pmd double flushing pmd
  selftests/powerpc: Add a test of wild bctr
  powerpc/mm: Fix page table dump to work on Radix
  powerpc/mm/radix: Display if mappings are exec or not
  powerpc/mm/radix: Simplify split mapping logic
  powerpc/mm/radix: Remove the retry in the split mapping logic
  powerpc/mm/radix: Fix small page at boundary when splitting
  powerpc/mm/radix: Fix overuse of small pages in splitting logic
  powerpc/mm/radix: Fix off-by-one in split mapping logic
  powerpc/ftrace: Handle large kernel configs
  powerpc/mm: Fix WARN_ON with THP NUMA migration
  selftests/powerpc: Fix out-of-tree build errors
  powerpc/time: no steal_time when CONFIG_PPC_SPLPAR is not selected
  powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64
  powerpc/time: isolate scaled cputime accounting in dedicated functions.
  ...
2018-10-26 14:36:21 -07:00
Christophe Leroy
1a210878bf powerpc/8xx: Use patch_site for memory setup patching
The 8xx TLB miss routines are patched at startup at several places.

This patch uses the new patch_site functionality in order
to get a better code readability and avoid a label mess when
dumping the code with 'objdump -d'

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-26 21:58:58 +11:00
Christophe Leroy
cc4ebf5c0a Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP"
This reverts commit 4f94b2c746.

That commit was buggy, as it used rlwinm instead of rlwimi.
Instead of fixing that bug, we revert the previous commit in order to
reduce the dependency between L1 entries and L2 entries

Fixes: 4f94b2c746 ("powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-26 21:58:58 +11:00
Linus Torvalds
0d1e8b8d2b KVM updates for v4.20
ARM:
  - Improved guest IPA space support (32 to 52 bits)
 
  - RAS event delivery for 32bit
 
  - PMU fixes
 
  - Guest entry hardening
 
  - Various cleanups
 
  - Port of dirty_log_test selftest
 
 PPC:
  - Nested HV KVM support for radix guests on POWER9.  The performance is
    much better than with PR KVM.  Migration and arbitrary level of
    nesting is supported.
 
  - Disable nested HV-KVM on early POWER9 chips that need a particular hardware
    bug workaround
 
  - One VM per core mode to prevent potential data leaks
 
  - PCI pass-through optimization
 
  - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base
 
 s390:
  - Initial version of AP crypto virtualization via vfio-mdev
 
  - Improvement for vfio-ap
 
  - Set the host program identifier
 
  - Optimize page table locking
 
 x86:
  - Enable nested virtualization by default
 
  - Implement Hyper-V IPI hypercalls
 
  - Improve #PF and #DB handling
 
  - Allow guests to use Enlightened VMCS
 
  - Add migration selftests for VMCS and Enlightened VMCS
 
  - Allow coalesced PIO accesses
 
  - Add an option to perform nested VMCS host state consistency check
    through hardware
 
  - Automatic tuning of lapic_timer_advance_ns
 
  - Many fixes, minor improvements, and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJb0FINAAoJEED/6hsPKofoI60IAJRS3vOAQ9Fav8cJsO1oBHcX
 3+NexfnBke1bzrjIR3SUcHKGZbdnVPNZc+Q4JjIbPpPmmOMU5jc9BC1dmd5f4Vzh
 BMnQ0yCvgFv3A3fy/Icx1Z8NJppxosdmqdQLrQrNo8aD3cjnqY2yQixdXrAfzLzw
 XEgKdIFCCz8oVN/C9TT4wwJn6l9OE7BM5bMKGFy5VNXzMu7t64UDOLbbjZxNgi1g
 teYvfVGdt5mH0N7b2GPPWRbJmgnz5ygVVpVNQUEFrdKZoCm6r5u9d19N+RRXAwan
 ZYFj10W2T8pJOUf3tryev4V33X7MRQitfJBo4tP5hZfi9uRX89np5zP1CFE7AtY=
 =yEPW
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "ARM:
   - Improved guest IPA space support (32 to 52 bits)

   - RAS event delivery for 32bit

   - PMU fixes

   - Guest entry hardening

   - Various cleanups

   - Port of dirty_log_test selftest

  PPC:
   - Nested HV KVM support for radix guests on POWER9. The performance
     is much better than with PR KVM. Migration and arbitrary level of
     nesting is supported.

   - Disable nested HV-KVM on early POWER9 chips that need a particular
     hardware bug workaround

   - One VM per core mode to prevent potential data leaks

   - PCI pass-through optimization

   - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base

  s390:
   - Initial version of AP crypto virtualization via vfio-mdev

   - Improvement for vfio-ap

   - Set the host program identifier

   - Optimize page table locking

  x86:
   - Enable nested virtualization by default

   - Implement Hyper-V IPI hypercalls

   - Improve #PF and #DB handling

   - Allow guests to use Enlightened VMCS

   - Add migration selftests for VMCS and Enlightened VMCS

   - Allow coalesced PIO accesses

   - Add an option to perform nested VMCS host state consistency check
     through hardware

   - Automatic tuning of lapic_timer_advance_ns

   - Many fixes, minor improvements, and cleanups"

* tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
  Revert "kvm: x86: optimize dr6 restore"
  KVM: PPC: Optimize clearing TCEs for sparse tables
  x86/kvm/nVMX: tweak shadow fields
  selftests/kvm: add missing executables to .gitignore
  KVM: arm64: Safety check PSTATE when entering guest and handle IL
  KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips
  arm/arm64: KVM: Enable 32 bits kvm vcpu events support
  arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension()
  KVM: arm64: Fix caching of host MDCR_EL2 value
  KVM: VMX: enable nested virtualization by default
  KVM/x86: Use 32bit xor to clear registers in svm.c
  kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD
  kvm: vmx: Defer setting of DR6 until #DB delivery
  kvm: x86: Defer setting of CR2 until #PF delivery
  kvm: x86: Add payload operands to kvm_multiple_exception
  kvm: x86: Add exception payload fields to kvm_vcpu_events
  kvm: x86: Add has_payload and payload to kvm_queued_exception
  KVM: Documentation: Fix omission in struct kvm_vcpu_events
  KVM: selftests: add Enlightened VMCS test
  ...
2018-10-25 17:57:35 -07:00
Linus Torvalds
ba9f6f8954 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo updates from Eric Biederman:
 "I have been slowly sorting out siginfo and this is the culmination of
  that work.

  The primary result is in several ways the signal infrastructure has
  been made less error prone. The code has been updated so that manually
  specifying SEND_SIG_FORCED is never necessary. The conversion to the
  new siginfo sending functions is now complete, which makes it
  difficult to send a signal without filling in the proper siginfo
  fields.

  At the tail end of the patchset comes the optimization of decreasing
  the size of struct siginfo in the kernel from 128 bytes to about 48
  bytes on 64bit. The fundamental observation that enables this is by
  definition none of the known ways to use struct siginfo uses the extra
  bytes.

  This comes at the cost of a small user space observable difference.
  For the rare case of siginfo being injected into the kernel only what
  can be copied into kernel_siginfo is delivered to the destination, the
  rest of the bytes are set to 0. For cases where the signal and the
  si_code are known this is safe, because we know those bytes are not
  used. For cases where the signal and si_code combination is unknown
  the bits that won't fit into struct kernel_siginfo are tested to
  verify they are zero, and the send fails if they are not.

  I made an extensive search through userspace code and I could not find
  anything that would break because of the above change. If it turns out
  I did break something it will take just the revert of a single change
  to restore kernel_siginfo to the same size as userspace siginfo.

  Testing did reveal dependencies on preferring the signo passed to
  sigqueueinfo over si->signo, so bit the bullet and added the
  complexity necessary to handle that case.

  Testing also revealed bad things can happen if a negative signal
  number is passed into the system calls. Something no sane application
  will do but something a malicious program or a fuzzer might do. So I
  have fixed the code that performs the bounds checks to ensure negative
  signal numbers are handled"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
  signal: Guard against negative signal numbers in copy_siginfo_from_user32
  signal: Guard against negative signal numbers in copy_siginfo_from_user
  signal: In sigqueueinfo prefer sig not si_signo
  signal: Use a smaller struct siginfo in the kernel
  signal: Distinguish between kernel_siginfo and siginfo
  signal: Introduce copy_siginfo_from_user and use it's return value
  signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
  signal: Fail sigqueueinfo if si_signo != sig
  signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
  signal/unicore32: Use force_sig_fault where appropriate
  signal/unicore32: Generate siginfo in ucs32_notify_die
  signal/unicore32: Use send_sig_fault where appropriate
  signal/arc: Use force_sig_fault where appropriate
  signal/arc: Push siginfo generation into unhandled_exception
  signal/ia64: Use force_sig_fault where appropriate
  signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
  signal/ia64: Use the generic force_sigsegv in setup_frame
  signal/arm/kvm: Use send_sig_mceerr
  signal/arm: Use send_sig_fault where appropriate
  signal/arm: Use force_sig_fault where appropriate
  ...
2018-10-24 11:22:39 +01:00
Christophe Leroy
b6ae3550c8 powerpc/8xx: add missing header in 8xx_mmu.c
arch/powerpc/mm/8xx_mmu.c:174:6: error: no previous prototype for ‘set_context’ [-Werror=missing-prototypes]
 void set_context(unsigned long id, pgd_t *pgd)

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2018-10-22 19:11:58 -05:00
Nicholas Piggin
dd76ff5af3 powerpc/64s/radix: Fix radix__flush_tlb_collapsed_pmd double flushing pmd
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Michael Ellerman
0d923962ab powerpc/mm: Fix page table dump to work on Radix
When we're running on Book3S with the Radix MMU enabled the page table
dump currently prints the wrong addresses because it uses the wrong
start address.

Fix it to use PAGE_OFFSET rather than KERN_VIRT_START.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Michael Ellerman
afb6d0647f powerpc/mm/radix: Display if mappings are exec or not
At boot we print the ranges we've mapped for the linear mapping and
what page size we've used. Also track whether the range is mapped
executable or not and display that as well.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Michael Ellerman
232aa40763 powerpc/mm/radix: Simplify split mapping logic
If we look closely at the logic in create_physical_mapping(), when
we're doing STRICT_KERNEL_RWX, we do the following steps:
  - determine the gap from where we are to the end of the range
  - choose an appropriate mapping_size based on the gap
  - check if that mapping_size would overlap the __init_begin
    boundary, and if not choose an appropriate mapping_size

We can simplify the logic by taking the __init_begin boundary into
account when we calculate the initial gap.

So add a next_boundary() function which tells us what the next
boundary is, either the __init_begin boundary or end. In future we can
add more boundaries.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Michael Ellerman
57306c663d powerpc/mm/radix: Remove the retry in the split mapping logic
When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the
linear mapping at the text/data boundary so we can map the kernel
text read only.

The current logic uses a goto inside the for loop, which works, but is
hard to reason about.

When we hit the goto retry case we set max_mapping_size to PMD_SIZE
and go back to the start.

Setting max_mapping_size means we skip the PUD case and go to the PMD
case.

We know we will pass the alignment and gap checks because the only
reason we are there is we hit the goto retry, and that is guarded by
mapping_size == PUD_SIZE, which means addr is PUD aligned and gap is
greater or equal to PUD_SIZE.

So the only part of the check that can fail is the mmu_psize_defs
check for the 2M page size.

If we just duplicate that check we can avoid the goto, and we get the
same result.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Michael Ellerman
81d1b54dec powerpc/mm/radix: Fix small page at boundary when splitting
When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the
linear mapping at the text/data boundary so we can map the kernel
text read only.

Currently we always use a small page at the text/data boundary, even
when that's not necessary:

  Mapped 0x0000000000000000-0x0000000000e00000 with 2.00 MiB pages
  Mapped 0x0000000000e00000-0x0000000001000000 with 64.0 KiB pages
  Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages

This is because the check that the mapping crosses the __init_begin
boundary is too strict, it also returns true when we map exactly up to
the boundary.

So fix it to check that the mapping would actually map past
__init_begin, and with that we see:

  Mapped 0x0000000000000000-0x0000000040000000 with 2.00 MiB pages
  Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Michael Ellerman
3b5657ed5b powerpc/mm/radix: Fix overuse of small pages in splitting logic
When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the
linear mapping at the text/data boundary so we can map the kernel text
read only.

But the current logic uses small pages for the entire text section,
regardless of whether a larger page size would fit. eg. with the
boundary at 16M we could use 2M pages, but instead we use 64K pages up
to the 16M boundary:

  Mapped 0x0000000000000000-0x0000000001000000 with 64.0 KiB pages
  Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages
  Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages

This is because the test is checking if addr is < __init_begin
and addr + mapping_size is >= _stext. But that is true for all pages
between _stext and __init_begin.

Instead what we want to check is if we are crossing the text/data
boundary, which is at __init_begin. With that fixed we see:

  Mapped 0x0000000000000000-0x0000000000e00000 with 2.00 MiB pages
  Mapped 0x0000000000e00000-0x0000000001000000 with 64.0 KiB pages
  Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages
  Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages

ie. we're correctly using 2MB pages below __init_begin, but we still
drop down to 64K pages unnecessarily at the boundary.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Michael Ellerman
5c6499b704 powerpc/mm/radix: Fix off-by-one in split mapping logic
When we have CONFIG_STRICT_KERNEL_RWX enabled, we try to split the
kernel linear (1:1) mapping so that the kernel text is in a separate
page to kernel data, so we can mark the former read-only.

We could achieve that just by always using 64K pages for the linear
mapping, but we try to be smarter. Instead we use huge pages when
possible, and only switch to smaller pages when necessary.

However we have an off-by-one bug in that logic, which causes us to
calculate the wrong boundary between text and data.

For example with the end of the kernel text at 16M we see:

  radix-mmu: Mapped 0x0000000000000000-0x0000000001200000 with 64.0 KiB pages
  radix-mmu: Mapped 0x0000000001200000-0x0000000040000000 with 2.00 MiB pages
  radix-mmu: Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages

ie. we mapped from 0 to 18M with 64K pages, even though the boundary
between text and data is at 16M.

With the fix we see we're correctly hitting the 16M boundary:

  radix-mmu: Mapped 0x0000000000000000-0x0000000001000000 with 64.0 KiB pages
  radix-mmu: Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages
  radix-mmu: Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Aneesh Kumar K.V
dd0e144a63 powerpc/mm: Fix WARN_ON with THP NUMA migration
WARNING: CPU: 12 PID: 4322 at /arch/powerpc/mm/pgtable-book3s64.c:76 set_pmd_at+0x4c/0x2b0
 Modules linked in:
 CPU: 12 PID: 4322 Comm: qemu-system-ppc Tainted: G        W         4.19.0-rc3-00758-g8f0c636b0542 #36
 NIP:  c0000000000872fc LR: c000000000484eec CTR: 0000000000000000
 REGS: c000003fba876fe0 TRAP: 0700   Tainted: G        W          (4.19.0-rc3-00758-g8f0c636b0542)
 MSR:  900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 24282884  XER: 00000000
 CFAR: c000000000484ee8 IRQMASK: 0
 GPR00: c000000000484eec c000003fba877268 c000000001f0ec00 c000003fbd229f80
 GPR04: 00007c8fe8e00000 c000003f864c5a38 860300853e0000c0 0000000000000080
 GPR08: 0000000080000000 0000000000000001 0401000000000080 0000000000000001
 GPR12: 0000000000002000 c000003fffff5400 c000003fce292000 00007c9024570000
 GPR16: 0000000000000000 0000000000ffffff 0000000000000001 c000000001885950
 GPR20: 0000000000000000 001ffffc0004807c 0000000000000008 c000000001f49d05
 GPR24: 00007c8fe8e00000 c0000000020f2468 ffffffffffffffff c000003fcd33b090
 GPR28: 00007c8fe8e00000 c000003fbd229f80 c000003f864c5a38 860300853e0000c0
 NIP [c0000000000872fc] set_pmd_at+0x4c/0x2b0
 LR [c000000000484eec] do_huge_pmd_numa_page+0xb1c/0xc20
 Call Trace:
 [c000003fba877268] [c00000000045931c] mpol_misplaced+0x1bc/0x230 (unreliable)
 [c000003fba8772c8] [c000000000484eec] do_huge_pmd_numa_page+0xb1c/0xc20
 [c000003fba877398] [c00000000040d344] __handle_mm_fault+0x5e4/0x2300
 [c000003fba8774d8] [c00000000040f400] handle_mm_fault+0x3a0/0x420
 [c000003fba877528] [c0000000003ff6f4] __get_user_pages+0x2e4/0x560
 [c000003fba877628] [c000000000400314] get_user_pages_unlocked+0x104/0x2a0
 [c000003fba8776c8] [c000000000118f44] __gfn_to_pfn_memslot+0x284/0x6a0
 [c000003fba877748] [c0000000001463a0] kvmppc_book3s_radix_page_fault+0x360/0x12d0
 [c000003fba877838] [c000000000142228] kvmppc_book3s_hv_page_fault+0x48/0x1300
 [c000003fba877988] [c00000000013dc08] kvmppc_vcpu_run_hv+0x1808/0x1b50
 [c000003fba877af8] [c000000000126b44] kvmppc_vcpu_run+0x34/0x50
 [c000003fba877b18] [c000000000123268] kvm_arch_vcpu_ioctl_run+0x288/0x2d0
 [c000003fba877b98] [c00000000011253c] kvm_vcpu_ioctl+0x1fc/0x8c0
 [c000003fba877d08] [c0000000004e9b24] do_vfs_ioctl+0xa44/0xae0
 [c000003fba877db8] [c0000000004e9c44] ksys_ioctl+0x84/0xf0
 [c000003fba877e08] [c0000000004e9cd8] sys_ioctl+0x28/0x80

We removed the pte_protnone check earlier with the understanding that we
mark the pte invalid before the set_pte/set_pmd usage. But the huge pmd
autonuma still use the set_pmd_at directly. This is ok because a protnone pte
won't have translation cache in TLB.

Fixes: da7ad366b4 ("powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Christophe Leroy
37e9c674e7 powerpc/mm: fix always true/false warning in slice.c
This patch fixes the following warnings (obtained with make W=1).

arch/powerpc/mm/slice.c: In function 'slice_range_to_mask':
arch/powerpc/mm/slice.c:73:12: error: comparison is always true due to limited range of data type [-Werror=type-limits]
  if (start < SLICE_LOW_TOP) {
            ^
arch/powerpc/mm/slice.c:81:20: error: comparison is always false due to limited range of data type [-Werror=type-limits]
  if ((start + len) > SLICE_LOW_TOP) {
                    ^
arch/powerpc/mm/slice.c: In function 'slice_mask_for_free':
arch/powerpc/mm/slice.c:136:17: error: comparison is always true due to limited range of data type [-Werror=type-limits]
  if (high_limit <= SLICE_LOW_TOP)
                 ^
arch/powerpc/mm/slice.c: In function 'slice_check_range_fits':
arch/powerpc/mm/slice.c:185:12: error: comparison is always true due to limited range of data type [-Werror=type-limits]
  if (start < SLICE_LOW_TOP) {
            ^
arch/powerpc/mm/slice.c:195:39: error: comparison is always false due to limited range of data type [-Werror=type-limits]
  if (SLICE_NUM_HIGH && ((start + len) > SLICE_LOW_TOP)) {
                                       ^
arch/powerpc/mm/slice.c: In function 'slice_scan_available':
arch/powerpc/mm/slice.c:306:11: error: comparison is always true due to limited range of data type [-Werror=type-limits]
  if (addr < SLICE_LOW_TOP) {
           ^
arch/powerpc/mm/slice.c: In function 'get_slice_psize':
arch/powerpc/mm/slice.c:709:11: error: comparison is always true due to limited range of data type [-Werror=type-limits]
  if (addr < SLICE_LOW_TOP) {
           ^

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Christophe Leroy
aa5456abdc powerpc/mm: fix missing prototypes in slice.c
This patch fixes the following warnings (obtained with make W=1).

arch/powerpc/mm/slice.c: At top level:
arch/powerpc/mm/slice.c:682:15: error: no previous prototype for 'arch_get_unmapped_area' [-Werror=missing-prototypes]
 unsigned long arch_get_unmapped_area(struct file *filp,
               ^
arch/powerpc/mm/slice.c:692:15: error: no previous prototype for 'arch_get_unmapped_area_topdown' [-Werror=missing-prototypes]
 unsigned long arch_get_unmapped_area_topdown(struct file *filp,
               ^

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Christophe Leroy
8114c36ea6 powerpc/mm: Trace tlbia instruction
Add a trace point for tlbia (Translation Lookaside Buffer Invalidate
All) instruction.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Christophe Leroy
cf4a608515 powerpc/mm: Add missing tracepoint for tlbie
commit 0428491cba ("powerpc/mm: Trace tlbie(l) instructions")
added tracepoints for tlbie calls, but _tlbil_va() was forgotten

Fixes: 0428491cba ("powerpc/mm: Trace tlbie(l) instructions")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Christophe Leroy
3ff38e1874 powerpc/book3s64: fix dump_linuxpagetables "present" flag
Since commit bd0dbb73e0 ("powerpc/mm/books3s: Add new pte bit to
mark pte temporarily invalid."), _PAGE_PRESENT doesn't mean exactly
that a page is present. A page is also considered preset when
_PAGE_INVALID is set.

This patch changes the meaning of "present" and adds a status "valid"
associated to the _PAGE_PRESENT flag.

Fixes: bd0dbb73e0 ("powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20 13:26:47 +11:00
Michael Ellerman
23ad1a2700 powerpc: Add -Werror at arch/powerpc level
Back when I added -Werror in commit ba55bd7436 ("powerpc: Add
configurable -Werror for arch/powerpc") I did it by adding it to most
of the arch Makefiles.

At the time we excluded math-emu, because apparently it didn't build
cleanly. But that seems to have been fixed somewhere in the interim.

So move the -Werror addition to the top-level of the arch, this saves
us from repeating it in every Makefile and means we won't forget to
add it to any new sub-dirs.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-19 00:56:17 +11:00
Aneesh Kumar K.V
4ffe713b75 powerpc/mm: Increase the max addressable memory to 2PB
Currently we limit the max addressable memory to 128TB. This patch increase the
limit to 2PB. We can have devices like nvdimm which adds memory above 512TB
limit.

We still don't support regular system ram above 512TB. One of the challenge with
that is the percpu allocator, that allocates per node memory and use the max
distance between them as the percpu offsets. This means with large gap in
address space ( system ram above 1PB) we will run out of vmalloc space to map
the percpu allocation.

In order to support addressable memory above 512TB, kernel should be able to
linear map this range. To do that with hash translation we now add 4 context
to kernel linear map region. Our per context addressable range is 512TB. We
still keep VMALLOC and VMEMMAP region to old size. SLB miss handlers is updated
to validate these limit.

We also limit this update to SPARSEMEM_VMEMMAP and SPARSEMEM_EXTREME

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Aneesh Kumar K.V
c9f80734cd powerpc/mm/hash: Rename get_ea_context to get_user_context
We will be adding get_kernel_context later. Update function name to indicate
this handle context allocation user space address.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Nicholas Piggin
e15a4fea4d powerpc/64s/hash: Add some SLB debugging tests
This adds CONFIG_DEBUG_VM checks to ensure:
  - The kernel stack is in the SLB after it's flushed and bolted.
  - We don't insert an SLB for an address that is aleady in the SLB.
  - The kernel SLB miss handler does not take an SLB miss.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Nicholas Piggin
94ee42727c powerpc/64s/hash: Simplify slb_flush_and_rebolt()
slb_flush_and_rebolt() is misleading, it is called in virtual mode, so
it can not possibly change the stack, so it should not be touching the
shadow area. And since vmalloc is no longer bolted, it should not
change any bolted mappings at all.

Change the name to slb_flush_and_restore_bolted(), and have it just
load the kernel stack from what's currently in the shadow SLB area.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Nicholas Piggin
5434ae7462 powerpc/64s/hash: Add a SLB preload cache
When switching processes, currently all user SLBEs are cleared, and a
few (exec_base, pc, and stack) are preloaded. In trivial testing with
small apps, this tends to miss the heap and low 256MB segments, and it
will also miss commonly accessed segments on large memory workloads.

Add a simple round-robin preload cache that just inserts the last SLB
miss into the head of the cache and preloads those at context switch
time. Every 256 context switches, the oldest entry is removed from the
cache to shrink the cache and require fewer slbmte if they are unused.

Much more could go into this, including into the SLB entry reclaim
side to track some LRU information etc, which would require a study of
large memory workloads. But this is a simple thing we can do now that
is an obvious win for common workloads.

With the full series, process switching speed on the context_switch
benchmark on POWER9/hash (with kernel speculation security masures
disabled) increases from 140K/s to 178K/s (27%).

POWER8 does not change much (within 1%), it's unclear why it does not
see a big gain like POWER9.

Booting to busybox init with 256MB segments has SLB misses go down
from 945 to 69, and with 1T segments 900 to 21. These could almost all
be eliminated by preloading a bit more carefully with ELF binary
loading.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Nicholas Piggin
425d331462 powerpc/64s/hash: Provide arch_setup_exec() hooks for hash slice setup
This will be used by the SLB code in the next patch, but for now this
sets the slb_addr_limit to the correct size for 32-bit tasks.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Nicholas Piggin
126b11b294 powerpc/64s/hash: Add SLB allocation status bitmaps
Add 32-entry bitmaps to track the allocation status of the first 32
SLB entries, and whether they are user or kernel entries. These are
used to allocate free SLB entries first, before resorting to the round
robin allocator.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Nicholas Piggin
48e7b76957 powerpc/64s/hash: Convert SLB miss handlers to C
This patch moves SLB miss handlers completely to C, using the standard
exception handler macros to set up the stack and branch to C.

This can be done because the segment containing the kernel stack is
always bolted, so accessing it with relocation on will not cause an
SLB exception.

Arbitrary kernel memory must not be accessed when handling kernel
space SLB misses, so care should be taken there. However user SLB
misses can access any kernel memory, which can be used to move some
fields out of the paca (in later patches).

User SLB misses could quite easily reconcile IRQs and set up a first
class kernel environment and exit via ret_from_except, however that
doesn't seem to be necessary at the moment, so we only do that if a
bad fault is encountered.

[ Credit to Aneesh for bug fixes, error checks, and improvements to
  bad address handling, etc ]

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Disallow tracing for all of slb.c for now.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Christophe Leroy
ff00552578 powerpc/8xx: change name of a few page flags to avoid confusion
_PAGE_PRIVILEGED corresponds to the SH bit which doesn't protect
against user access but only disables ASID verification on kernel
accesses. User access is controlled with _PMD_USER flag.

Name it _PAGE_SH instead of _PAGE_PRIVILEGED

_PAGE_HUGE corresponds to the SPS bit which doesn't really tells
that's it is a huge page but only that it is not a 4k page.

Name it _PAGE_SPS instead of _PAGE_HUGE

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Christophe Leroy
97026b5a5a powerpc/mm: Split dump_pagelinuxtables flag_array table
To reduce the complexity of flag_array, and allow the removal of
default 0 value of non existing flags, lets have one flag_array
table for each platform family with only the really existing flags.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Christophe Leroy
26973fa5ac powerpc/mm: use pte helpers in generic code
Get rid of platform specific _PAGE_XXXX in powerpc common code and
use helpers instead.

mm/dump_linuxpagetables.c will be handled separately

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Christophe Leroy
34eb138ed7 powerpc/mm: don't use _PAGE_EXEC for calling hash_preload()
The 'access' parameter of hash_preload() is either 0 or _PAGE_EXEC.
Among the two versions of hash_preload(), only the PPC64 one is
doing something with this 'access' parameter.

In order to remove the use of _PAGE_EXEC outside platform code,
'access' parameter is replaced by 'is_exec' which will be either
true of false, and the PPC64 version of hash_preload() creates
the access flag based on 'is_exec'.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Christophe Leroy
d81e6f8b7c powerpc/mm: don't use _PAGE_EXEC in book3s/32
book3s/32 doesn't define _PAGE_EXEC, so no need to use it.

All other platforms define _PAGE_EXEC so no need to check
it is not NUL when not book3s/32.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Christophe Leroy
c766ee7223 powerpc: handover page flags with a pgprot_t parameter
In order to avoid multiple conversions, handover directly a
pgprot_t to map_kernel_page() as already done for radix.

Do the same for __ioremap_caller() and __ioremap_at().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Christophe Leroy
56f3c1413f powerpc/mm: properly set PAGE_KERNEL flags in ioremap()
Set PAGE_KERNEL directly in the caller and do not rely on a
hack adding PAGE_KERNEL flags when _PAGE_PRESENT is not set.

As already done for PPC64, use pgprot_cache() helpers instead of
_PAGE_XXX flags in PPC32 ioremap() derived functions.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Christophe Leroy
86c391bd5f powerpc/32: Add ioremap_wt() and ioremap_coherent()
Other arches have ioremap_wt() to map IO areas write-through.
Implement it on PPC as well in order to avoid drivers using
__ioremap(_PAGE_WRITETHRU)

Also implement ioremap_coherent() to avoid drivers using
__ioremap(_PAGE_COHERENT)

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-13 22:21:25 +11:00
Michael Bringmann
65b9fdadfc powerpc/pseries/mobility: Extend start/stop topology update scope
The powerpc mobility code may receive RTAS requests to perform PRRN
(Platform Resource Reassignment Notification) topology changes at any
time, including during LPAR migration operations.

In some configurations where the affinity of CPUs or memory is being
changed on that platform, the PRRN requests may apply or refer to
outdated information prior to the complete update of the device-tree.

This patch changes the duration for which topology updates are
suppressed during LPAR migrations from just the rtas_ibm_suspend_me()
/ 'ibm,suspend-me' call(s) to cover the entire migration_store()
operation to allow all changes to the device-tree to be applied prior
to accepting and applying any PRRN requests.

For tracking purposes, pr_info notices are added to the functions
start_topology_update() and stop_topology_update() of 'numa.c'.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-13 22:21:25 +11:00
Michael Ellerman
9b7e4d601b Merge branch 'fixes' into next
Merge our fixes branch. It has a few important fixes that are needed for
futher testing and also some commits that will conflict with content in
next.
2018-10-09 16:51:05 +11:00
Suraj Jitindar Singh
fd10be2573 KVM: PPC: Book3S HV: Handle page fault for a nested guest
Consider a normal (L1) guest running under the main hypervisor (L0),
and then a nested guest (L2) running under the L1 guest which is acting
as a nested hypervisor. L0 has page tables to map the address space for
L1 providing the translation from L1 real address -> L0 real address;

	L1
	|
	| (L1 -> L0)
	|
	----> L0

There are also page tables in L1 used to map the address space for L2
providing the translation from L2 real address -> L1 read address. Since
the hardware can only walk a single level of page table, we need to
maintain in L0 a "shadow_pgtable" for L2 which provides the translation
from L2 real address -> L0 real address. Which looks like;

	L2				L2
	|				|
	| (L2 -> L1)			|
	|				|
	----> L1			| (L2 -> L0)
	      |				|
	      | (L1 -> L0)		|
	      |				|
	      ----> L0			--------> L0

When a page fault occurs while running a nested (L2) guest we need to
insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
do this we need to:

1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
   provide a page fault to L1 if this mapping doesn't exist.
2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
   or try to insert a pte for that mapping if it doesn't exist.
3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable

Once this mapping exists we can take rc faults when hardware is unable
to automatically set the reference and change bits in the pte. On these
we need to:

1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
   the fault down to L1.
2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
   host page.
3. Set the rc bits in the L2 -> L0 pte.

As we reuse a large number of functions in book3s_64_mmu_radix.c for
this we also needed to refactor a number of these functions to take
an lpid parameter so that the correct lpid is used for tlb invalidations.
The functionality however has remained the same.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Greg Kroah-Hartman
cd2093cb45 powerpc fixes for 4.19 #4
Four regression fixes.
 
 A fix for a change to lib/xz which broke our zImage loader when building with XZ
 compression. OK'ed by Herbert who merged the original patch.
 
 The recent fix we did to avoid patching __init text broke some 32-bit machines,
 fix that.
 
 Our show_user_instructions() could be tricked into printing kernel memory, add a
 check to avoid that.
 
 And a fix for a change to our NUMA initialisation logic, which causes crashes in
 some kdump configurations.
 
 Thanks to:
   Christophe Leroy, Hari Bathini, Jann Horn, Joel Stanley, Meelis Roos, Murilo
   Opsfelder Araujo, Srikar Dronamraju.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlu4n8QTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgGJ+D/4nJ4ij6tb6LaIfMLxX7qCk43XFcXei
 R0wMzliZp0sC3KqVgoeGZxpXL6vAI3G46Ix+4lTUEw2LxpDpkQBRXKVnaium2Cs7
 qDSdycoWaqy5e40DRxUSno8oPtjMmDS43YnjY+/8ByA4hhF9XqzDGtF5EuCvc5Fr
 Lsts+jT3fvxVEAQM/10KnCFXDAO3gh8HBbDX3AwMRwCyxuQXjLSqDfW0+gX5ZuOC
 6LVZ2jGw1yen0mwFKPVWiJCMYeuaBn0u1LvzGb7XWHbITUBNGdx8YuOiLzAuOPYN
 OfP7+TPDsp5lbyB5KWNOVUB2ajxXUCMeT4CWZ++0HpUi1rcbJIz8ksJ9ZjA+Y4Ij
 FWOdauPdozDs6GUsvp1w9FHaD4B4KFDE7d5sKoIg73pKUPS+M/2JOIpNn/DMtg0a
 RgArGaxcswIGaJLV+SW9K8laZb085HIAFJCKuHgRPixtz2t35y+f+lTyYVg/BIpF
 lI/h60BSf7PMX0Jp1I1D8fUhmgFWcgz/rPN3J1BIG3jNjMmIU6bB+muGBlYZFDeC
 tICbYa4FAKkc8KHCALb7iF2k3BFa28HxzHU6JbDm66JNaMRD0tSpTEgX7spMr7XS
 RFySP+ofTBuzxwunsGWLjQy2rnY/DnFtbQgesDVByzTfCYhjQYqEdS38/LmWDZ6w
 pLqBpNfSgvHd0g==
 =FxNJ
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Michael writes:
  "powerpc fixes for 4.19 #4

   Four regression fixes.

   A fix for a change to lib/xz which broke our zImage loader when
   building with XZ compression. OK'ed by Herbert who merged the
   original patch.

   The recent fix we did to avoid patching __init text broke some 32-bit
   machines, fix that.

   Our show_user_instructions() could be tricked into printing kernel
   memory, add a check to avoid that.

   And a fix for a change to our NUMA initialisation logic, which causes
   crashes in some kdump configurations.

   Thanks to:
     Christophe Leroy, Hari Bathini, Jann Horn, Joel Stanley, Meelis
     Roos, Murilo Opsfelder Araujo, Srikar Dronamraju."

* tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/numa: Skip onlining a offline node in kdump path
  powerpc: Don't print kernel instructions in show_user_instructions()
  powerpc/lib: fix book3s/32 boot failure due to code patching
  lib/xz: Put CRC32_POLY_LE in xz_private.h
2018-10-07 07:05:43 +02:00
Srikar Dronamraju
ac1788cc7d powerpc/numa: Skip onlining a offline node in kdump path
With commit 2ea6263068 ("powerpc/topology: Get topology for shared
processors at boot"), kdump kernel on shared LPAR may crash.

The necessary conditions are
- Shared LPAR with at least 2 nodes having memory and CPUs.
- Memory requirement for kdump kernel must be met by the first N-1
  nodes where there are at least N nodes with memory and CPUs.

Example numactl of such a machine.
  $ numactl -H
  available: 5 nodes (0,2,5-7)
  node 0 cpus:
  node 0 size: 0 MB
  node 0 free: 0 MB
  node 2 cpus:
  node 2 size: 255 MB
  node 2 free: 189 MB
  node 5 cpus: 24 25 26 27 28 29 30 31
  node 5 size: 4095 MB
  node 5 free: 4024 MB
  node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
  node 6 size: 6353 MB
  node 6 free: 5998 MB
  node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
  node 7 size: 7640 MB
  node 7 free: 7164 MB
  node distances:
  node   0   2   5   6   7
    0:  10  40  40  40  40
    2:  40  10  40  40  40
    5:  40  40  10  40  40
    6:  40  40  40  10  20
    7:  40  40  40  20  10

Steps to reproduce.
1. Load / start kdump service.
2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger)

When booting a kdump kernel with 2048M:

  kexec: Starting switchover sequence.
  I'm in purgatory
  Using 1TB segments
  hash-mmu: Initializing hash mmu with SLB
  Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018
  Found initrd at 0xc000000009e70000:0xc00000000ae554b4
  Using pSeries machine description
  -----------------------------------------------------
  ppc64_pft_size    = 0x1e
  phys_mem_size     = 0x88000000
  dcache_bsize      = 0x80
  icache_bsize      = 0x80
  cpu_features      = 0x000000ff8f5d91a7
    possible        = 0x0000fbffcf5fb1a7
    always          = 0x0000006f8b5c91a1
  cpu_user_features = 0xdc0065c2 0xef000000
  mmu_features      = 0x7c006001
  firmware_features = 0x00000007c45bfc57
  htab_hash_mask    = 0x7fffff
  physical_start    = 0x8000000
  -----------------------------------------------------
  numa:   NODE_DATA [mem 0x87d5e300-0x87d67fff]
  numa:     NODE_DATA(0) on node 6
  numa:   NODE_DATA [mem 0x87d54600-0x87d5e2ff]
  Top of RAM: 0x88000000, Total RAM: 0x88000000
  Memory hole size: 0MB
  Zone ranges:
    DMA      [mem 0x0000000000000000-0x0000000087ffffff]
    DMA32    empty
    Normal   empty
  Movable zone start for each node
  Early memory node ranges
    node   6: [mem 0x0000000000000000-0x0000000087ffffff]
  Could not find start_pfn for node 0
  Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
  On node 0 totalpages: 0
  Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff]
  On node 6 totalpages: 34816

  Unable to handle kernel paging request for data at address 0x00000060
  Faulting instruction address: 0xc000000008703a54
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in:
  CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1
  NIP:  c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000
  REGS: c00000000b673440 TRAP: 0380   Not tainted  (4.19.0-rc5-master+)
  MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24022022  XER: 20000002
  CFAR: c0000000086fc238 IRQMASK: 0
  GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000
  GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080
  GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220
  GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008
  GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000
  GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008
  GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78
  GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400
  NIP [c000000008703a54] bus_add_device+0x84/0x1e0
  LR [c000000008703a38] bus_add_device+0x68/0x1e0
  Call Trace:
  [c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable)
  [c00000000b673740] [c000000008700194] device_add+0x454/0x7c0
  [c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240
  [c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180
  [c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90
  [c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190
  [c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580
  [c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108
  [c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c
  [c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450
  [c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160
  [c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80
  Instruction dump:
  60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c
  e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378
  ---[ end trace 593577668c2daa65 ]---

However a regular kernel with 4096M (2048 gets reserved for crash
kernel) boots properly.

Unlike regular kernels, which mark all available nodes as online,
kdump kernel only marks just enough nodes as online and marks the rest
as offline at boot. However kdump kernel boots with all available
CPUs. With Commit 2ea6263068 ("powerpc/topology: Get topology for
shared processors at boot"), all CPUs are onlined on their respective
nodes at boot time. try_online_node() tries to online the offline
nodes but fails as all needed subsystems are not yet initialized.

As part of fix, detect and skip early onlining of a offline node.

Fixes: 2ea6263068 ("powerpc/topology: Get topology for shared processors at boot")
Reported-by: Pavithra Prakash <pavrampu@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-05 23:21:54 +10:00
Nicholas Piggin
053c5a753e powerpc/64s/radix: Explicitly flush ERAT with local LPID invalidation
Local radix TLB flush operations that operate on congruence classes
have explicit ERAT flushes for POWER9. The process scoped LPID flush
did not have a flush, so add it.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 23:16:53 +10:00