Commit Graph

1993 Commits

Author SHA1 Message Date
Christophe Leroy
4cfac2f9c7 powerpc/mm: Simplify __set_fixmap()
__set_fixmap() uses __fix_to_virt() then does the boundary checks
by it self. Instead, we can use fix_to_virt() which does the
verification at build time. For this, we need to use it inline
so that GCC can see the real value of idx at buildtime.

In the meantime, we remove the 'fixmaps' variable.
This variable is set but has never been used from the beginning
(commit 2c419bdeca ("[POWERPC] Port fixmap from x86 and use
for kmap_atomic"))

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:58 +10:00
Christophe Leroy
86b19520e7 powerpc/mm: declare some local functions static
get_pteptr() and __mapin_ram_chunk() are only used locally,
so define them static

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:57 +10:00
Christophe Leroy
95902e6c88 powerpc/mm: Implement STRICT_KERNEL_RWX on PPC32
This patch implements STRICT_KERNEL_RWX on PPC32.

As for CONFIG_DEBUG_PAGEALLOC, it deactivates BAT and LTLB mappings
in order to allow page protection setup at the level of each page.

As BAT/LTLB mappings are deactivated, there might be a performance
impact.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:57 +10:00
Christophe Leroy
3184cc4b6f powerpc/mm: Fix kernel RAM protection after freeing unused memory on PPC32
As seen below, allthough the init sections have been freed, the
associated memory area is still marked as executable in the
page tables.

~ dmesg
[    5.860093] Freeing unused kernel memory: 592K (c0570000 - c0604000)

~ cat /sys/kernel/debug/kernel_page_tables
---[ Start of kernel VM ]---
0xc0000000-0xc0497fff        4704K  rw  X  present dirty accessed shared
0xc0498000-0xc056ffff         864K  rw     present dirty accessed shared
0xc0570000-0xc059ffff         192K  rw  X  present dirty accessed shared
0xc05a0000-0xc7ffffff      125312K  rw     present dirty accessed shared
---[ vmalloc() Area ]---

This patch fixes that.

The implementation is done by reusing the change_page_attr()
function implemented for CONFIG_DEBUG_PAGEALLOC

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:56 +10:00
Christophe Leroy
e611939fc8 powerpc/mm: Ensure change_page_attr() doesn't invalidate pinned TLBs
__change_page_attr() uses flush_tlb_page().
flush_tlb_page() uses tlbie instruction, which also invalidates
pinned TLBs, which is not what we expect.

This patch modifies the implementation to use flush_tlb_kernel_range()
instead. This will make use of tlbia which will preserve pinned TLBs.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:56 +10:00
Christophe Leroy
346bcc4d33 powerpc/8xx: mark init functions with __init
setup_initial_memory_limit() is only called during init.
mmu_patch_cmp_limit() is only called from 8xx_mmu.c

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:54 +10:00
Christophe Leroy
a3059b0ca0 powerpc/8xx: Make pinning of ITLBs optional
As stated in a comment in head_8xx.S, today we "Always pin the first
8 MB ITLB to prevent ITLB misses while mucking around with SRR0/SRR1
in asm".

This issue has just been cleared by the preceding patch, therefore
we can make this pinning optional (on by default) and independent
of DATA pinning.

This patch also makes pinning of IMMR independent of pinning of DATA.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:53 +10:00
Christophe Leroy
eef784bbe7 powerpc/8xx: Ensures RAM mapped with LTLB is seen as block mapped on 8xx.
On the 8xx, the RAM mapped with LTLBs must be seen as block mapped,
just like areas mapped with BATs on standard PPC32.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:52 +10:00
Michael Ellerman
7559952e1f powerpc/mm: Fix section mismatch warning in early_check_vec5()
early_check_vec5() is called from and calls __init routines, so should
also be __init.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 23:40:51 +10:00
Christophe Leroy
4915349b10 powerpc/8xx: Use symbolic names for DSISR bits in DSI
Use symbolic names for DSISR bits in DSI

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 23:32:20 +10:00
Christophe Leroy
968159c003 powerpc/8xx: Getting rid of remaining use of CONFIG_8xx
Two config options exist to define powerpc MPC8xx:
* CONFIG_PPC_8xx
* CONFIG_8xx

arch/powerpc/platforms/Kconfig.cputype has contained the following
comment about CONFIG_8xx item for some years:
"# this is temp to handle compat with arch=ppc"

arch/powerpc is now the only place with remaining use of
CONFIG_8xx: get rid of them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 23:32:12 +10:00
Suraj Jitindar Singh
7cd2a8695e powerpc/mm: Properly invalidate when setting process table base
The host process table base is stored in the partition table by calling
the function native_register_process_table(). Currently this just sets
the entry in memory and is missing a subsequent cache invalidation
instruction. Any update to the partition table should be followed by a
cache invalidation instruction specifying invalidation of the caching of
any partition table entries (RIC = 2, PRS = 0).

We already have a function to update the partition table with the
required cache invalidation instructions - mmu_partition_table_set_entry().
Update the native_register_process_table() function to call
mmu_partition_table_set_entry(), this ensures all appropriate
invalidation will be performed.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Use a local for patb0 to clean it up slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 22:30:03 +10:00
Michael Ellerman
21a0e8c14b powerpc/mm/hash64: Make vmalloc 56T on hash
On 64-bit book3s, with the hash MMU, we currently define the kernel
virtual space (vmalloc, ioremap etc.), to be 16T in size. This is a
leftover from pre v3.7 when our user VM was also 16T.

Of that 16T we split it 50/50, with half used for PCI IO and ioremap
and the other 8T for vmalloc.

We never bothered to make it any bigger because 8T of vmalloc ought to
be enough for anybody. But it turns out that's not true, the per cpu
allocator wants large amounts of vmalloc space, not to make large
allocations, but to allow a large stride between allocations, because
we use pcpu_embed_first_chunk().

With a bit of juggling we can increase the entire kernel virtual space
to 64T. The only real complication is the check of the address in the
SLB miss handler, see the comment in the code.

Although we could continue to split virtual space 50/50 as we do now,
no one seems to be running out of PCI IO or ioremap space. So instead
keep that as 8T, and use the remaining 56T for vmalloc.

In future we should be able to increase the kernel virtual space to
512T, the code already supports that, it just needs testing on older
hardware.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2017-08-08 19:37:05 +10:00
Michael Ellerman
b5048de04b powerpc/mm/slb: Move comment next to the code it's referring to
There is a comment in slb_allocate() referring to the load of
paca->vmalloc_sllp, but it's several lines prior in the assembly.
We're about to change this code, and we want to add another comment,
so move the comment immediately prior to the instruction it's talking
about.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08 19:37:04 +10:00
Michael Ellerman
63ee9b2ff9 powerpc/mm/book3s64: Make KERN_IO_START a variable
Currently KERN_IO_START is defined as:

 #define KERN_IO_START  (KERN_VIRT_START + (KERN_VIRT_SIZE >> 1))

Although it looks like a constant, both the components are actually
variables, to allow us to have a different value between Radix and
Hash with a single kernel.

However that still requires both Radix and Hash to place the kernel IO
region at the same location relative to the start and end of the
kernel virtual region (namely 1/2 way through it), and we'd like to
change that.

So split KERN_IO_START out into its own variable, and initialise it
for Radix and Hash. In the medium term we should be able to
reconsolidate this, by doing a more involved rearrangement of the
location of the regions.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08 19:37:04 +10:00
Benjamin Herrenschmidt
6ff4d3e966 powerpc: Remove old unused icswx based coprocessor support
We have a whole pile of unused code to maintain the ACOP register,
allocate coprocessor PIDs and handle ACOP faults. This mechanism
was used for the HFI adapter on POWER7 which is dead and gone and
whose driver never went upstream. It was used on some A2 core based
stuff that also never saw the light of day.

Take out all that code.

There is still some POWER8 coprocessor code that uses icswx but it's
kernel only and thus doesn't use any of that infrastructure.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:52 +10:00
Benjamin Herrenschmidt
8f5ca0b319 powerpc/mm: Cleanup check for stack expansion
When hitting below a VM_GROWSDOWN vma (typically growing the stack),
we check whether it's a valid stack-growing instruction and we
check the distance to GPR1. This is largely open coded with lots
of comments, so move it out to a helper.

While at it, make store_update_sp a boolean.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:51 +10:00
Benjamin Herrenschmidt
f43bb27ebf powerpc/mm: Don't lose "major" fault indication on retry
If the first iteration returns VM_FAULT_MAJOR but the second
one doesn't, we fail to account the fault as a major fault.

This fixes it and brings the code in line with x86.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:51 +10:00
Benjamin Herrenschmidt
bd0d63f809 powerpc/mm: Move page fault VMA access checks to a helper
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:51 +10:00
Benjamin Herrenschmidt
d2e0d2c51a powerpc/mm: Set fault flags earlier
Move out the code that sets FAULT_FLAG_WRITE so the block that check
access permissions can be extracted. While at it also set
FAULT_FLAG_INSTRUCTION which will be used for protection keys.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:50 +10:00
Benjamin Herrenschmidt
b15021d994 powerpc/mm: Add a bunch of (un)likely annotations to do_page_fault
Mostly for the failure cases

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:50 +10:00
Benjamin Herrenschmidt
11ccdd33d6 powerpc/mm: Move/simplify faulthandler_disabled() and !mm check
Do the check before we re-enable interrupts and clean the code
up a bit.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:49 +10:00
Benjamin Herrenschmidt
2865d08dd9 powerpc/mm: Move the DSISR_PROTFAULT sanity check
This has a page of comment explaining what's going on right in
the middle of do_page_fault() which makes things a bit hard to
follow. Move it to a helper instead. Also do the test earlier
as there's no point waiting until after we found the VMA.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:49 +10:00
Benjamin Herrenschmidt
04aafdc601 powerpc/mm: Cosmetic fix to page fault accounting
No need to break those lines, they aren't that long

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:48 +10:00
Benjamin Herrenschmidt
3da026480a powerpc/mm: Move CMO accounting out of do_page_fault into a helper
It makes do_page_fault() more readable. No functional change.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:48 +10:00
Benjamin Herrenschmidt
b5c8f0fd59 powerpc/mm: Rework mm_fault_error()
First, handle the normal retry failure in do_page_fault itself,
since it's a simple return statement. That allows us to remove
the "continue" special return code from mm_fault_error().

Once that's done, we can have an implementation much closer to
x86 where we only call mm_fault_error() if VM_FAULT_ERROR is set
and directly return.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:47 +10:00
Benjamin Herrenschmidt
c3350602e8 powerpc/mm: Make bad_area* helper functions
Instead of goto labels, instead call those functions and return.

This gets us closer to x86 and allows us to shring do_page_fault()
even more.

The main difference with x86 is that those function return a value
which we then return from do_page_fault(). That value is our
return value from do_page_fault() which we use to generate
kernel faults.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:47 +10:00
Benjamin Herrenschmidt
d3ca587404 powerpc/mm: Fix reporting of kernel execute faults
We currently test for is_exec and DSISR_PROTFAULT but that doesn't
make sense as this is the wrong error bit to test for an execute
permission failure.

In fact, we had code that would return early if we had an exec
fault in kernel mode so I think that was just dead code anyway.

Finally the location of that test is awkward and prevents further
simplifications.

So instead move that test into a helper along with the existing
early test for kernel exec faults and out of range accesses,
and put it all in a "bad_kernel_fault()" helper. While at it
test the correct error bits.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:46 +10:00
Benjamin Herrenschmidt
65d47fd4a3 powerpc/mm: Simplify returns from __do_page_fault
Now that we moved the exception state handling to a wrapper, we can
just directly return rather than "goto bail"

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:46 +10:00
Benjamin Herrenschmidt
bb4be50e61 powerpc/mm: Move debugger check to notify_page_fault()
unclutters the main path

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:46 +10:00
Benjamin Herrenschmidt
f3d96e698e powerpc/mm: Overhaul handling of bad page faults
A bad page fault is when the HW signals an error such as a bad
copy/paste, an AMO error, or some other type of error that will
not be fixed by updating the PTE.

Use a helper page_fault_is_bad() to check for bad page faults thus
removing the per-processor family open-coding in __do_page_fault()
and trigger a SIGBUS rather than a SIGSEGV which is more appropriate.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:45 +10:00
Benjamin Herrenschmidt
e6c8290a89 powerpc/mm: Move error_code checks for bad faults earlier
There's no point looking for the VMA etc.. when we already know
we are going to fail.

This adds some code to set "code" for the si_code but that will
be gone in subsequent patches.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:44 +10:00
Benjamin Herrenschmidt
41b464e5e5 powerpc/mm: Move out definition of CPU specific is_write bits
Define a common page_fault_is_write() helper and use it

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:44 +10:00
Benjamin Herrenschmidt
d300627c6a powerpc/6xx: Handle DABR match before calling do_page_fault
On legacy 6xx 32-bit procesors, we checked for the DABR match bit
in DSISR from do_page_fault(), in the middle of a pile of ifdef's
because all other CPU types do it in assembly prior to calling
do_page_fault. Fix that.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Add #ifdef CONFIG_6xx]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:26 +10:00
Benjamin Herrenschmidt
c433ec0455 powerpc/mm: Pre-filter SRR1 bits before do_page_fault()
By filtering the relevant SRR1 bits in the assembly rather than
in do_page_fault() itself, we avoid a conditional branch (since we
already come from different path for data and instruction faults).

This will allow more simplifications later

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02 13:11:07 +10:00
Benjamin Herrenschmidt
7afad422ac powerpc/mm: Move exception_enter/exit to a do_page_fault wrapper
This will allow simplifying the returns from do_page_fault

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02 13:11:07 +10:00
Benjamin Herrenschmidt
424de9c6e3 powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_range
We do that because it's used by THP pmd collapsing, so use
instead a dedicated flush function.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02 13:11:06 +10:00
Benjamin Herrenschmidt
a46cc7a90f powerpc/mm/radix: Improve TLB/PWC flushes
At the moment we have to rather sub-optimal flushing behaviours:

 - flush_tlb_mm() will flush the PWC which is unnecessary (for example
   when doing a fork)

 - A large unmap will call flush_tlb_pwc() multiple times causing us
   to perform that fairly expensive operation repeatedly. This happens
   often in batches of 3 on every new process.

So we change flush_tlb_mm() to only flush the TLB, and we use the
existing "need_flush_all" flag in struct mmu_gather to indicate
that the PWC needs flushing.

Unfortunately, flush_tlb_range() still needs to do a full flush
for now as it's used by the THP collapsing. We will fix that later.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02 13:11:06 +10:00
Benjamin Herrenschmidt
5ce5fe14ed powerpc/mm/radix: Improve _tlbiel_pid to be usable for PWC flushes
The PWC flush only needs a single set call, just like the
full (RIC=2) flush.

This will allow us to get rid of the dedicated _tlbiel_pwc()

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02 13:11:05 +10:00
Michael Ellerman
bb272221e9 Linux v4.13-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZapWhAAoJEHm+PkMAQRiGKb0IAJM6b7SbWaw69Og7+qiFB+zZ
 xp29iXqbE9fPISC6a5BRQV1ONjeDM6opGixGHqGC8Hla6k2IYz25VDNoF8wd0MXN
 cz/Ih20vd3C5afxXGe5cTT8lsPAlV0mWXxForlu6j8jPeL62FPfq6RhEkw7AcrYL
 yfYy3k3qSdOrrvBdII0WAAUi46UfIs+we9BQgbsMbkHOiqV2K0MOrzKE84Xbgepq
 RAy2xg6P4b4+hTx8xTrYc1MXwpnqjRc0oJ08gdmiwW3AOOU7LxYFn7zDkLPWi9Rr
 g4x6r4YhBTGxT4wNvovLIiqd9QFs//dMCuPWYwEtTICG48umIqqq24beQ0mvCdg=
 =08Ic
 -----END PGP SIGNATURE-----

Merge tag 'v4.13-rc1' into fixes

The fixes branch is based off a random pre-rc1 commit, because we had
some fixes that needed to go in before rc1 was released.

However we now need to fix some code that went in after that point, but
before rc1, so merge rc1 to get that code into fixes so we can fix it!
2017-07-31 20:20:29 +10:00
Rui Teng
23493c1219 powerpc/mm: Fix check of multiple 16G pages from device tree
The offset of hugepage block will not be 16G, if the expected
page is more than one. Calculate the totol size instead of the
hardcode value.

Fixes: 4792adbac9 ("powerpc: Don't use a 16G page if beyond mem= limits")
Signed-off-by: Rui Teng <rui.teng@linux.vnet.ibm.com>
Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-31 16:56:58 +10:00
Aneesh Kumar K.V
0da12a7a81 powerpc/mm/hash: Free the subpage_prot_table correctly
Fixes: dad6f37c26 ("powerpc: subpage_protect: Increase the array size to take care of 64TB")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-27 13:05:50 +10:00
Benjamin Herrenschmidt
a25bd72bad powerpc/mm/radix: Workaround prefetch issue with KVM
There's a somewhat architectural issue with Radix MMU and KVM.

When coming out of a guest with AIL (Alternate Interrupt Location, ie,
MMU enabled), we start executing hypervisor code with the PID register
still containing whatever the guest has been using.

The problem is that the CPU can (and will) then start prefetching or
speculatively load from whatever host context has that same PID (if
any), thus bringing translations for that context into the TLB, which
Linux doesn't know about.

This can cause stale translations and subsequent crashes.

Fixing this in a way that is neither racy nor a huge performance
impact is difficult. We could just make the host invalidations always
use broadcast forms but that would hurt single threaded programs for
example.

We chose to fix it instead by partitioning the PID space between guest
and host. This is possible because today Linux only use 19 out of the
20 bits of PID space, so existing guests will work if we make the host
use the top half of the 20 bits space.

We additionally add support for a property to indicate to Linux the
size of the PID register which will be useful if we eventually have
processors with a larger PID space available.

There is still an issue with malicious guests purposefully setting the
PID register to a value in the hosts PID range. Hopefully future HW
can prevent that, but in the meantime, we handle it with a pair of
kludges:

 - On the way out of a guest, before we clear the current VCPU in the
   PACA, we check the PID and if it's outside of the permitted range
   we flush the TLB for that PID.

 - When context switching, if the mm is "new" on that CPU (the
   corresponding bit was set for the first time in the mm cpumask), we
   check if any sibling thread is in KVM (has a non-NULL VCPU pointer
   in the PACA). If that is the case, we also flush the PID for that
   CPU (core).

This second part is needed to handle the case where a process is
migrated (or starts a new pthread) on a sibling thread of the CPU
coming out of KVM, as there's a window where stale translations can
exist before we detect it and flush them out.

A future optimization could be added by keeping track of whether the
PID has ever been used and avoid doing that for completely fresh PIDs.
We could similarily mark PIDs that have been the subject of a global
invalidation as "fresh". But for now this will do.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
      unneeded include of kvm_book3s_asm.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-26 16:41:52 +10:00
Aneesh Kumar K.V
7e7dc66adc powerpc/mm: Build fix for non SPARSEMEM_VMEMAP config
We can use pfn_to_page() in realmode for other configs. Hence remove the
CONFIG_FLATMEM ifdef.

Fixes: 8e0861fa3c ("powerpc: Prepare to support kernel handling of IOMMU map/unmap")
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Also fix up the #endif comment]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-24 22:39:08 +10:00
Linus Torvalds
10fc95547f powerpc fixes for 4.13 #3
A handful of fixes, mostly for new code.
 
 Some reworking of the new STRICT_KERNEL_RWX support to make sure we also remove
 executable permission from __init memory before it's freed.
 
 A fix to some recent optimisations to the hypercall entry where we were
 clobbering r12, this was breaking nested guests (PR KVM).
 
 A fix for the recent patch to opal_configure_cores(). This could break booting
 on bare metal Power8 boxes if the kernel was built without
 CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG.
 
 And finally a workaround for spurious PMU interrupts on Power9 DD2.
 
 Thanks to:
   Nicholas Piggin, Anton Blanchard, Balbir Singh.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZcctRAAoJEFHr6jzI4aWAji8P/RaG3/vGmhP4HGsAh7jHx/3v
 EIBsjXPozQ16AIs4235JcQ2Vg54LLms6IaFfw1w9oYZxi98sNC3b56Rtd3AGxLbq
 WHEGAD+mE9aXUOXkycLRkKZ+THiNzR0ZQKiMDqU+oJ9XbIAoof2gI4nc7UzUxGnt
 EqxERiL2RrYqRBvO3W2ErtL29yXhopUhv+HKdbuMlIcoH4uooSdn435MPleIJWxl
 WRFFv17DFw7PN6juwlff2YWvbTgrM1ND8czLfINzSZQDy0F+HEbKDFbtHlgAvOFN
 GjbODX9OuU/QRrPQv6MoY2UumZNtLSCUsw5BUg0y4Qaak8p0gAEb4D/gmvIZNp/r
 9ZqJDPfFKh/oA08apLTPJBKQmDUwCaYHyHVJNw10CU9GZ9CzW/2/B2/h2Egpgqao
 vt0TgR3FXYizhXGAEpmBZzadAg9VKZTXH/irN6X62wMxNfvRIDOS57829QFHr0ka
 wbsOK1gI1rE1g1GwsZafyURRSe95Sv84RDHW1fFQAtvFgpA3eHl26LUkpcvwOOFI
 53g9zMedYOyZXXIdBn05QojwxgXZ0ry1Wc93oDHOn3rgDL6XBeOXFfFI5jIwPJDL
 VZhOSvN1v+Ti3Q+I05zH+ll6BMhlU8RKBf+esq419DiLUeaNLHZR6GF0+UVJyJo+
 WcLsJ1x/GDa9pOBfMEJy
 =Oy4/
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "A handful of fixes, mostly for new code:

   - some reworking of the new STRICT_KERNEL_RWX support to make sure we
     also remove executable permission from __init memory before it's
     freed.

   - a fix to some recent optimisations to the hypercall entry where we
     were clobbering r12, this was breaking nested guests (PR KVM).

   - a fix for the recent patch to opal_configure_cores(). This could
     break booting on bare metal Power8 boxes if the kernel was built
     without CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG.

   - .. and finally a workaround for spurious PMU interrupts on Power9
     DD2.

  Thanks to: Nicholas Piggin, Anton Blanchard, Balbir Singh"

* tag 'powerpc-4.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y
  powerpc/mm/hash: Refactor hash__mark_rodata_ro()
  powerpc/mm/radix: Refactor radix__mark_rodata_ro()
  powerpc/64s: Fix hypercall entry clobbering r12 input
  powerpc/perf: Avoid spurious PMU interrupts after idle
  powerpc/powernv: Fix boot on Power8 bare metal due to opal_configure_cores()
2017-07-21 13:54:37 -07:00
Michael Ellerman
029d9252b1 powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y
Currently even with STRICT_KERNEL_RWX we leave the __init text marked
executable after init, which is bad.

Add a hook to mark it NX (no-execute) before we free it, and implement
it for radix and hash.

Note that we use __init_end as the end address, not _einittext,
because overlaps_kernel_text() uses __init_end, because there are
additional executable sections other than .init.text between
__init_begin and __init_end.

Tested on radix and hash with:

  0:mon> p $__init_begin
  *** 400 exception occurred

Fixes: 1e0fc9d1eb ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18 19:54:24 +10:00
Michael Ellerman
fa7f9189e0 powerpc/mm/hash: Refactor hash__mark_rodata_ro()
Move the core logic into a helper, so we can use it for changing other
permissions.

We also change the logic to align start down, and end up. This means
calling the function with a range will expand that range to be at
least 1 mmu_linear_psize page in size. We need that so we can use it
on __init_begin ...  __init_end which is not a full page in size.

This should always work for _stext/__init_begin, because we align
__init_begin to _stext + 16M in the linker script.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18 18:51:35 +10:00
Michael Ellerman
b134bd9028 powerpc/mm/radix: Refactor radix__mark_rodata_ro()
Move the core logic into a helper, so we can use it for changing permissions
other than _PAGE_WRITE.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18 18:51:34 +10:00
Linus Torvalds
deed9deb62 powerpc fixes for 4.13 #2
Nothing that really stands out, just a bunch of fixes that have come in in the
 last couple of weeks.
 
 None of these are actually fixes for code that is new in 4.13. It's roughly half
 older bugs, with fixes going to stable, and half fixes/updates for Power9.
 
 Thanks to:
   Aneesh Kumar K.V, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
   Madhavan Srinivasan, Michael Neuling, Nicholas Piggin, Oliver O'Halloran.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZaJKuAAoJEFHr6jzI4aWAojsQAI/Mnu3T9XCPbcuWJVHiNZu2
 CubNKFb9H+HdjRZlgdT7dpx0XITCNWziQmBpQZO+w1kwa3s5uv5qNwj8I30HyTFH
 qAArMq2Yb7GIW4PR8IpbSdCEWZii4YcIiLYk3kiEMyC2UIWeznBq+j79jFyITd9T
 beevhLLq7dMoLO8T2VRSdp4YhdLKNYmgjQG1YFh3YIh8s/tyPQ0OsDaxNk5z8wgt
 /htWiVQ2X/xWRHBnVKT4olc97pXqTvtaCdBWjD/fKNZK50YIwAjr2cJVoXpJidAW
 POwwMto7rjSJDYpE6IgqQnhz0VqRTNB/4QkyaeHJkt6BPhAaafkrHZYtDgc2GaUJ
 G96J6xT9bwHn8Jz1tpsxSgSA9A2GfBhdOUKMpIUVn1l9Q1ELPA0dJzzagLhgRDRG
 w97UQ+CJqvRNfJkRHcuJbNqP5i++6yOHcHGB8QLkmkSYVcCm9Hj/fMXj/DJ5tHt/
 c9t6uw25eqI7raFcxfh09+QTX8VDV96fa+GTo1HnCH71nGG2GqCoJFXyb3HEW64s
 gD7uOPOlFlqfpuGDmV1kmdwH0R15EIJb2b6QsDssrcHEg5fA6z/xcAcV839c9y47
 U8tcBiqFF0vWgB1QljTDrKuNsjg8dIeZfkSekBGD0WuBVLSADBl7f3b2k3BZB5H3
 WxosyR4ufSVfw53HaDsb
 =6eIT
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Nothing that really stands out, just a bunch of fixes that have come
  in in the last couple of weeks.

  None of these are actually fixes for code that is new in 4.13. It's
  roughly half older bugs, with fixes going to stable, and half
  fixes/updates for Power9.

  Thanks to: Aneesh Kumar K.V, Anton Blanchard, Balbir Singh, Benjamin
  Herrenschmidt, Madhavan Srinivasan, Michael Neuling, Nicholas Piggin,
  Oliver O'Halloran"

* tag 'powerpc-4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64: Fix atomic64_inc_not_zero() to return an int
  powerpc: Fix emulation of mfocrf in emulate_step()
  powerpc: Fix emulation of mcrf in emulate_step()
  powerpc/perf: Add POWER9 alternate PM_RUN_CYC and PM_RUN_INST_CMPL events
  powerpc/perf: Fix SDAR_MODE value for continous sampling on Power9
  powerpc/asm: Mark cr0 as clobbered in mftb()
  powerpc/powernv: Fix local TLB flush for boot and MCE on POWER9
  powerpc/mm/radix: Synchronize updates to the process table
  powerpc/mm/radix: Properly clear process table entry
  powerpc/powernv: Tell OPAL about our MMU mode on POWER9
  powerpc/kexec: Fix radix to hash kexec due to IAMR/AMOR
2017-07-14 15:33:15 -07:00
Rik van Riel
0a782dc31f powerpc,mmap: properly account for stack randomization in mmap_base
When RLIMIT_STACK is, for example, 256MB, the current code results in a
gap between the top of the task and mmap_base of 256MB, failing to take
into account the amount by which the stack address was randomized.  In
other words, the stack gets less than RLIMIT_STACK space.

Ensure that the gap between the stack and mmap_base always takes stack
randomization and the stack guard gap into account.

Inspired by Daniel Micay's linux-hardened tree.

Link: http://lkml.kernel.org/r/20170622200033.25714-4-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Florian Weimer <fweimer@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Benjamin Herrenschmidt
3a6a04706f powerpc/mm/radix: Synchronize updates to the process table
When writing to the process table, we need to ensure the store is
visible to a subsequent access by the MMU. We assume we never have
the PID active while doing the update, so a ptesync/isync pair
should hopefully be a big enough hammer for our purpose.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-10 21:26:31 +10:00
Benjamin Herrenschmidt
c6bb0b8d42 powerpc/mm/radix: Properly clear process table entry
On radix, the process table entry we want to clear when destroying a
context is entry 0, not entry 1. This has no *immediate* consequence
on Power9, but it can cause other bugs to become worse.

Fixes: 7e381c0ff6 ("powerpc/mm/radix: Add mmu context handling callback for radix")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-10 21:24:34 +10:00
Linus Torvalds
d691b7e7d1 powerpc updates for 4.13
Highlights include:
 
  - Support for STRICT_KERNEL_RWX on 64-bit server CPUs.
 
  - Platform support for FSP2 (476fpe) board
 
  - Enable ZONE_DEVICE on 64-bit server CPUs.
 
  - Generic & powerpc spin loop primitives to optimise busy waiting
 
  - Convert VDSO update function to use new update_vsyscall() interface
 
  - Optimisations to hypercall/syscall/context-switch paths
 
  - Improvements to the CPU idle code on Power8 and Power9.
 
 As well as many other fixes and improvements.
 
 Thanks to:
   Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman Khandual, Anton
   Blanchard, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
   Lombard, Colin Ian King, Dan Carpenter, Gautham R. Shenoy, Hari Bathini, Ian
   Munsie, Ivan Mikhaylov, Javier Martinez Canillas, Madhavan Srinivasan,
   Masahiro Yamada, Matt Brown, Michael Neuling, Michal Suchanek, Murilo
   Opsfelder Araujo, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul
   Mackerras, Pavel Machek, Russell Currey, Santosh Sivaraj, Stephen Rothwell,
   Thiago Jung Bauermann, Yang Li.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZXyPCAAoJEFHr6jzI4aWAI9QQAISf2x5y//cqCi4ISyQB5pTq
 KLS/yQajNkQOw7c0fzBZOaH5Xd/SJ6AcKWDg8yDlpDR3+sRRsr98iIRECgKS5I7/
 DxD9ywcbSoMXFQQo1ZMCp5CeuMUIJRtugBnUQM+JXCSUCPbznCY5DchDTLyTBTpO
 MeMVhI//JxthhoOMA9MudiEGaYCU9ho442Z4OJUSiLUv8WRbvQX9pTqoc4vx1fxA
 BWf2mflztBVcIfKIyxIIIlDLukkMzix6gSYPMCbC7lzkbnU7JSqKiheJXjo1gJS2
 ePHKDxeNR2/QP0g/j3aT/MR1uTt9MaNBSX3gANE1xQ9OoJ8m1sOtCO4gNbSdLWae
 eXhDnoiEp930DRZOeEioOItuWWoxFaMyYk3BMmRKV4QNdYL3y3TRVeL2/XmRGqYL
 Lxz4IY/x5TteFEJNGcRX90uizNSi8AaEXPF16pUk8Ctt6eH3ZSwPMv2fHeYVCMr0
 KFlKHyaPEKEoztyzLcUR6u9QB56yxDN58bvLpd32AeHvKhqyxFoySy59x9bZbatn
 B2y8mmDItg860e0tIG6jrtplpOVvL8i5jla5RWFVoQDuxxrLAds3vG9JZQs+eRzx
 Fiic93bqeUAS6RzhXbJ6QQJYIyhE7yqpcgv7ME1W87SPef3HPBk9xlp3yIDwdA2z
 bBDvrRnvupusz8qCWrxe
 =w8rj
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Support for STRICT_KERNEL_RWX on 64-bit server CPUs.

   - Platform support for FSP2 (476fpe) board

   - Enable ZONE_DEVICE on 64-bit server CPUs.

   - Generic & powerpc spin loop primitives to optimise busy waiting

   - Convert VDSO update function to use new update_vsyscall() interface

   - Optimisations to hypercall/syscall/context-switch paths

   - Improvements to the CPU idle code on Power8 and Power9.

  As well as many other fixes and improvements.

  Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman
  Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
  Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter,
  Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier
  Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown,
  Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N.
  Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek,
  Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung
  Bauermann, Yang Li"

* tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
  powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs
  powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix
  powerpc/mm/hash: Implement mark_rodata_ro() for hash
  powerpc/vmlinux.lds: Align __init_begin to 16M
  powerpc/lib/code-patching: Use alternate map for patch_instruction()
  powerpc/xmon: Add patch_instruction() support for xmon
  powerpc/kprobes/optprobes: Use patch_instruction()
  powerpc/kprobes: Move kprobes over to patch_instruction()
  powerpc/mm/radix: Fix execute permissions for interrupt_vectors
  powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp()
  powerpc/64s: Blacklist rtas entry/exit from kprobes
  powerpc/64s: Blacklist functions invoked on a trap
  powerpc/64s: Un-blacklist system_call() from kprobes
  powerpc/64s: Move system_call() symbol to just after setting MSR_EE
  powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
  powerpc/64s: Convert .L__replay_interrupt_return to a local label
  powerpc64/elfv1: Only dereference function descriptor for non-text symbols
  cxl: Export library to support IBM XSL
  powerpc/dts: Use #include "..." to include local DT
  powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8
  ...
2017-07-07 13:55:45 -07:00
Punit Agrawal
7868a2087e mm/hugetlb: add size parameter to huge_pte_offset()
A poisoned or migrated hugepage is stored as a swap entry in the page
tables.  On architectures that support hugepages consisting of
contiguous page table entries (such as on arm64) this leads to ambiguity
in determining the page table entry to return in huge_pte_offset() when
a poisoned entry is encountered.

Let's remove the ambiguity by adding a size parameter to convey
additional information about the requested address.  Also fixup the
definition/usage of huge_pte_offset() throughout the tree.

Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE)
Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS)
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:34 -07:00
Aneesh Kumar K.V
40692eb5ee powerpc/mm/hugetlb: add support for 1G huge pages
POWER9 supports hugepages of size 2M and 1G in radix MMU mode.  This
patch enables the usage of 1G page size for hugetlbfs.  This also update
the helper such we can do 1G page allocation at runtime.

We still don't enable 1G page size on DD1 version.  This is to avoid
doing workaround mentioned in commit 6d3a0379eb ("powerpc/mm: Add
radix__tlb_flush_pte_p9_dd1()").

Link: http://lkml.kernel.org/r/1494995292-4443-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:33 -07:00
Aneesh Kumar K.V
28c057160e powerpc/mm/hugetlb: remove follow_huge_addr for powerpc
With generic code now handling hugetlb entries at pgd level and also
supporting hugepage directory format, we can now remove the powerpc
sepcific follow_huge_addr implementation.

Link: http://lkml.kernel.org/r/1494926612-23928-9-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:33 -07:00
Aneesh Kumar K.V
50791e6de0 powerpc/hugetlb: add follow_huge_pd implementation for ppc64
Link: http://lkml.kernel.org/r/1494926612-23928-8-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:33 -07:00
Michal Hocko
3d79a728f9 mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory
arch_add_memory gets for_device argument which then controls whether we
want to create memblocks for created memory sections.  Simplify the
logic by telling whether we want memblocks directly rather than going
through pointless negation.  This also makes the api easier to
understand because it is clear what we want rather than nothing telling
for_device which can mean anything.

This shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:32 -07:00
Michal Hocko
f1dd2cd13c mm, memory_hotplug: do not associate hotadded memory to zones until online
The current memory hotplug implementation relies on having all the
struct pages associate with a zone/node during the physical hotplug
phase (arch_add_memory->__add_pages->__add_section->__add_zone).  In the
vast majority of cases this means that they are added to ZONE_NORMAL.
This has been so since 9d99aaa31f ("[PATCH] x86_64: Support memory
hotadd without sparsemem") and it wasn't a big deal back then because
movable onlining didn't exist yet.

Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
onlining 511c2aba8f ("mm, memory-hotplug: dynamic configure movable
memory and portion memory") and then things got more complicated.
Rather than reconsidering the zone association which was no longer
needed (because the memory hotplug already depended on SPARSEMEM) a
convoluted semantic of zone shifting has been developed.  Only the
currently last memblock or the one adjacent to the zone_movable can be
onlined movable.  This essentially means that the online type changes as
the new memblocks are added.

Let's simulate memory hot online manually
  $ echo 0x100000000 > /sys/devices/system/memory/probe
  $ grep . /sys/devices/system/memory/memory32/valid_zones
  Normal Movable

  $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
  $ grep . /sys/devices/system/memory/memory3?/valid_zones
  /sys/devices/system/memory/memory32/valid_zones:Normal
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable

  $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
  $ grep . /sys/devices/system/memory/memory3?/valid_zones
  /sys/devices/system/memory/memory32/valid_zones:Normal
  /sys/devices/system/memory/memory33/valid_zones:Normal
  /sys/devices/system/memory/memory34/valid_zones:Normal Movable

  $ echo online_movable > /sys/devices/system/memory/memory34/state
  $ grep . /sys/devices/system/memory/memory3?/valid_zones
  /sys/devices/system/memory/memory32/valid_zones:Normal
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable
  /sys/devices/system/memory/memory34/valid_zones:Movable Normal

This is an awkward semantic because an udev event is sent as soon as the
block is onlined and an udev handler might want to online it based on
some policy (e.g.  association with a node) but it will inherently race
with new blocks showing up.

This patch changes the physical online phase to not associate pages with
any zone at all.  All the pages are just marked reserved and wait for
the onlining phase to be associated with the zone as per the online
request.  There are only two requirements

	- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap

	- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses

the latter one is not an inherent requirement and can be changed in the
future.  It preserves the current behavior and made the code slightly
simpler.  This is subject to change in future.

This means that the same physical online steps as above will lead to the
following state: Normal Movable

  /sys/devices/system/memory/memory32/valid_zones:Normal Movable
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable

  /sys/devices/system/memory/memory32/valid_zones:Normal Movable
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable
  /sys/devices/system/memory/memory34/valid_zones:Normal Movable

  /sys/devices/system/memory/memory32/valid_zones:Normal Movable
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable
  /sys/devices/system/memory/memory34/valid_zones:Movable

Implementation:
The current move_pfn_range is reimplemented to check the above
requirements (allow_online_pfn_range) and then updates the respective
zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
pfn range with the zone/node.  __add_pages is updated to not require the
zone and only initializes sections in the range.  This allowed to
simplify the arch_add_memory code (s390 could get rid of quite some of
code).

devm_memremap_pages is the only user of arch_add_memory which relies on
the zone association because it only hooks into the memory hotplug only
half way.  It uses it to associate the new memory with ZONE_DEVICE but
doesn't allow it to be {on,off}lined via sysfs.  This means that this
particular code path has to call move_pfn_range_to_zone explicitly.

The original zone shifting code is kept in place and will be removed in
the follow up patch for an easier review.

Please note that this patch also changes the original behavior when
offlining a memory block adjacent to another zone (Normal vs.  Movable)
used to allow to change its movable type.  This will be handled later.

[richard.weiyang@gmail.com: simplify zone_intersects()]
  Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
[richard.weiyang@gmail.com: remove duplicate call for set_page_links]
  Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
[akpm@linux-foundation.org: remove unused local `i']
Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:32 -07:00
Michal Hocko
1b862aecfb mm, memory_hotplug: get rid of is_zone_device_section
Device memory hotplug hooks into regular memory hotplug only half way.
It needs memory sections to track struct pages but there is no
need/desire to associate those sections with memory blocks and export
them to the userspace via sysfs because they cannot be onlined anyway.

This is currently expressed by for_device argument to arch_add_memory
which then makes sure to associate the given memory range with
ZONE_DEVICE.  register_new_memory then relies on is_zone_device_section
to distinguish special memory hotplug from the regular one.  While this
works now, later patches in this series want to move __add_zone outside
of arch_add_memory path so we have to come up with something else.

Add want_memblock down the __add_pages path and use it to control
whether the section->memblock association should be done.
arch_add_memory then just trivially want memblock for everything but
for_device hotplug.

remove_memory_section doesn't need is_zone_device_section either.  We
can simply skip all the memblock specific cleanup if there is no
memblock for the given section.

This shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:32 -07:00
Balbir Singh
7614ff3272 powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix
The Radix linear mapping code (create_physical_mapping()) tries to use
the largest page size it can at each step. Currently the only reason
it steps down to a smaller page size is if the start addr is
unaligned (never happens in practice), or the end of memory is not
aligned to a huge page boundary.

To support STRICT_RWX we need to break the mapping at __init_begin,
so that the text and rodata prior to that can be marked R_X and the
regular pages after can be marked RW.

Having done that we can now implement mark_rodata_ro() for Radix,
knowing that we won't need to split any mappings.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Split down to PAGE_SIZE, not 2MB, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-04 11:37:39 +10:00
Balbir Singh
cd65d69713 powerpc/mm/hash: Implement mark_rodata_ro() for hash
With hash we update the bolted pte to mark it read-only. We rely
on the MMU_FTR_KERNEL_RO to generate the correct permissions
for read-only text. The radix implementation just prints a warning
in this implementation

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Make the warning louder when we don't have MMU_FTR_KERNEL_RO]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-04 11:35:16 +10:00
Linus Torvalds
9a9594efe5 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP hotplug updates from Thomas Gleixner:
 "This update is primarily a cleanup of the CPU hotplug locking code.

  The hotplug locking mechanism is an open coded RWSEM, which allows
  recursive locking. The main problem with that is the recursive nature
  as it evades the full lockdep coverage and hides potential deadlocks.

  The rework replaces the open coded RWSEM with a percpu RWSEM and
  establishes full lockdep coverage that way.

  The bulk of the changes fix up recursive locking issues and address
  the now fully reported potential deadlocks all over the place. Some of
  these deadlocks have been observed in the RT tree, but on mainline the
  probability was low enough to hide them away."

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  cpu/hotplug: Constify attribute_group structures
  powerpc: Only obtain cpu_hotplug_lock if called by rtasd
  ARM/hw_breakpoint: Fix possible recursive locking for arch_hw_breakpoint_init
  cpu/hotplug: Remove unused check_for_tasks() function
  perf/core: Don't release cred_guard_mutex if not taken
  cpuhotplug: Link lock stacks for hotplug callbacks
  acpi/processor: Prevent cpu hotplug deadlock
  sched: Provide is_percpu_thread() helper
  cpu/hotplug: Convert hotplug locking to percpu rwsem
  s390: Prevent hotplug rwsem recursion
  arm: Prevent hotplug rwsem recursion
  arm64: Prevent cpu hotplug rwsem recursion
  kprobes: Cure hotplug lock ordering issues
  jump_label: Reorder hotplug lock and jump_label_lock
  perf/tracing/cpuhotplug: Fix locking order
  ACPI/processor: Use cpu_hotplug_disable() instead of get_online_cpus()
  PCI: Replace the racy recursion prevention
  PCI: Use cpu_hotplug_disable() instead of get_online_cpus()
  perf/x86/intel: Drop get_online_cpus() in intel_snb_check_microcode()
  x86/perf: Drop EXPORT of perf_check_microcode
  ...
2017-07-03 18:08:06 -07:00
Balbir Singh
7f6d498ed3 powerpc/mm/radix: Fix execute permissions for interrupt_vectors
Commit 9abcc981de ("powerpc/mm/radix: Only add X for pages
overlapping kernel text") changed the linear mapping on Radix to only
mark the kernel text executable.

However if the kernel is run relocated, for example as a kdump kernel,
then the exception vectors are split from the kernel text, ie. they
remain at real address 0.

We tend to get away with it, because the kernel itself will usually be
below 1G, which means the 1G page at 0-1G is marked executable and
everything works OK. However if the kernel is loaded above 1G, or the
system has less than 1G in total (meaning we can't use a 1G page),
then the exception vectors will not be marked executable and the
kernel will fail to boot.

Fix it by also checking if the address range overlaps the exception
vectors when deciding if we should add PAGE_KERNEL_X.

Fixes: 9abcc981de ("powerpc/mm/radix: Only add X for pages overlapping kernel text")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Combine with the existing check, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-03 23:12:19 +10:00
Michael Ellerman
218ea31039 Merge branch 'fixes' into next
Merge our fixes branch, a few of them are tripping people up while
working on top of next, and we also have a dependency between the CXL
fixes and new CXL code we want to merge into next.
2017-07-03 23:05:43 +10:00
Anton Blanchard
1b644f57b3 powerpc/mm: Wire up hpte_removebolted for powernv
Adds support for removing bolted (i.e kernel linear mapping) mappings on
powernv. This is needed to support memory hot unplug operations which
are required for the teardown of DAX/PMEM devices.

Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02 20:40:28 +10:00
Oliver O'Halloran
ebd3119793 powerpc/mm: Add devmap support for ppc64
Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S.  This
is used to differentiate device backed memory from transparent huge
pages since they are handled in more or less the same manner by the core
mm code.

Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02 20:40:28 +10:00
Oliver O'Halloran
b584c25440 powerpc/vmemmap: Add altmap support
Adds support to powerpc for the altmap feature of ZONE_DEVICE memory. An
altmap is a driver provided region that is used to provide the backing
storage for the struct pages of ZONE_DEVICE memory. In situations where
large amount of ZONE_DEVICE memory is being added to the system the
altmap reduces pressure on main system memory by allowing the mm/
metadata to be stored on the device itself rather in main memory.

Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02 20:40:27 +10:00
Oliver O'Halloran
d7d9b612f1 powerpc/vmemmap: Reshuffle vmemmap_free()
Removes an indentation level and shuffles some code around to make the
following patch cleaner. No functional changes.

Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02 20:40:26 +10:00
Oliver O'Halloran
7a849a6cf3 powerpc/hugetlbfs: Export HPAGE_SHIFT
Export it so it can be referenced inside a module.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02 20:40:25 +10:00
Nicholas Piggin
4e287e655e powerpc: use spin loop primitives in some functions
Use the different spin loop primitives in some simple powerpc
spin loops, including those which will spin as a common case.

This will help to test the spin loop primitives before more
conversions are done.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add some includes of <linux/processor.h>]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02 20:40:24 +10:00
Anshuman Khandual
39e4675183 powerpc/mm: Add comments on vmemmap physical mapping
Adds some explaination on how the vmemmap based struct page layout's
physical mapping is allocated and tracked through linked list. It
also keeps note of a possible race condition.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-28 13:08:17 +10:00
Anshuman Khandual
b0f36c10de powerpc/mm: Add comments to the vmemmap layout
Add some explaination to the layout of vmemmap virtual address
space and how physical page mapping is only used for valid PFNs
present at any point on the system.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-28 13:08:17 +10:00
Benjamin Herrenschmidt
74e27c6af5 powerpc: Only do ERAT invalidate on radix context switch on P9 DD1
From: Michael Neuling <mikey@neuling.org>

On P9 (Nimbus) DD2 and later, in radix mode, the move to the PID
register will implicitly invalidate the user space ERAT entries
and leave the kernel ones alone. Thus the only thing needed is
an isync() to synchronize this with subsequent uaccess's

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-27 14:15:54 +10:00
Balbir Singh
0428491cba powerpc/mm: Trace tlbie(l) instructions
Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate
Entry (Local)) instructions.

The tlbie instruction has changed over the years, so not all versions
accept the same operands. Use the ISA v3 field operands because they are
the most verbose, we may change them in future.

Example output:

  qemu-system-ppc-5371  [016]  1412.369519: tlbie:
  	tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Add some missing trace_tlbie()s, reword change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-23 21:14:49 +10:00
Thiago Jung Bauermann
3e401f7a2e powerpc: Only obtain cpu_hotplug_lock if called by rtasd
Calling arch_update_cpu_topology from a CPU hotplug state machine callback
hits a deadlock because the function tries to get a read lock on
cpu_hotplug_lock while the state machine still holds a write lock on it.

Since all callers of arch_update_cpu_topology except rtasd already hold
cpu_hotplug_lock, this patch changes the function to use
stop_machine_cpuslocked and creates a separate function for rtasd which
still tries to obtain the lock.

Michael Bringmann investigated the bug and provided a detailed analysis
of the deadlock on this previous RFC for an alternate solution:

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Michael Bringmann <mwb@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1497996510-4032-1-git-send-email-bauerman@linux.vnet.ibm.com
Link: https://patchwork.ozlabs.org/patch/771293/
2017-06-23 09:32:11 +02:00
Michael Ellerman
fd88b945c1 powerpc/64s: Rename slb_allocate_realmode() to slb_allocate()
As for slb_miss_realmode(), rename slb_allocate_realmode() to avoid
confusion over whether it runs in real or virtual mode - it runs in
both.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
2017-06-21 16:18:33 +10:00
Nicholas Piggin
d59afffdf0 powerpc/64s: Preserve r3 in slb_allocate_realmode()
One fewer registers clobbered by this function means the SLB miss
handler can save one fewer.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-20 22:18:25 +10:00
Hugh Dickins
1be7107fbe mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 21:50:20 +08:00
Michael Ellerman
9abcc981de powerpc/mm/radix: Only add X for pages overlapping kernel text
Currently we map the whole linear mapping with PAGE_KERNEL_X. Instead we
should check if the page overlaps the kernel text and only then add
PAGE_KERNEL_X.

Note that we still use 1G pages if they're available, so this will
typically still result in a 1G executable page at KERNELBASE. So this fix is
primarily useful for catching stray branches to high linear mapping addresses.

Without this patch, we can execute at 1G in xmon using:

  0:mon> m c000000040000000
  c000000040000000  00 l
  c000000040000000  00000000 01006038
  c000000040000004  00000000 2000804e
  c000000040000008  00000000 x
  0:mon> di c000000040000000
  c000000040000000  38600001      li      r3,1
  c000000040000004  4e800020      blr
  0:mon> p c000000040000000
  return value is 0x1

After we get a 400 as expected:

  0:mon> p c000000040000000
  *** 400 exception occurred

Fixes: 2bfd65e45e ("powerpc/mm/radix: Add radix callbacks for early init routines")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
2017-06-15 16:34:39 +10:00
Aneesh Kumar K.V
92d9dfda8b powerpc/mm/4k: Limit 4k page size config to 64TB virtual address space
Supporting 512TB requires us to do a order 3 allocation for level 1 page
table (pgd). This results in page allocation failures with certain workloads.
For now limit 4k linux page size config to 64TB.

Fixes: f6eedbba7a ("powerpc/mm/hash: Increase VA range to 128TB")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-08 20:42:56 +10:00
Christophe Leroy
4386c096c2 powerpc/mm: Rename map_page() to map_kernel_page() on 32-bit
These two functions implement the same semantics, so unify their naming so we
can share code that calls them. The longer name is more descriptive so use it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-05 19:59:03 +10:00
Balbir Singh
d2485644c7 powerpc/mm/hugetlb: Add support for page accounting
Add __GFP_ACCOUNT to __hugepte_alloc()

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-05 19:03:12 +10:00
Balbir Singh
abd667be15 powerpc/mm/book(e)(3s)/32: Add page table accounting
Add support in pte_alloc_one() and pgd_alloc() by
passing __GFP_ACCOUNT in the flags

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-05 19:03:11 +10:00
Balbir Singh
de3b87611d powerpc/mm/book(e)(3s)/64: Add page table accounting
Introduce a helper pgtable_gfp_flags() which
just returns the current gfp flags and adds
__GFP_ACCOUNT to account for page table allocation.
The generic helper is added to include/asm/pgalloc.h
and has two variants - WARNING ugly bits ahead

1. If the header is included from a module, no check
for mm == &init_mm is done, since init_mm is not
exported
2. For kernel includes, the check is done and required
see (3e79ec7 arch: x86: charge page tables to kmemcg)

The fundamental assumption is that no module should be
doing pgd/pud/pmd and pte alloc's on behalf of init_mm
directly.

NOTE: This adds an overhead to pmd/pud/pgd allocations
similar to x86.  The other alternative was to implement
pmd_alloc_kernel/pud_alloc_kernel and pgd_alloc_kernel
with their offset variants.

For 4k page size, pte_alloc_one no longer calls
pte_alloc_one_kernel.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-05 19:03:10 +10:00
Balbir Singh
c5cee6421c powerpc/mm/hash: Do a local flush if possible when no batch is active
Currently in hpte_need_flush() if there is no batch pending we always do a
global TLB flush, which is inefficient if the mm has never run on another
thread.

Instead do the same check that __flush_tlb_pending() does and check if a local
flush is sufficient when batch->active is false. Instead of open-coding it we
use mm_is_thread_local().

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Don't use a local, just inline mm_is_thread_local()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-05 19:02:55 +10:00
Christophe Leroy
92aa2fe039 powerpc/mm: The 8xx doesn't call do_page_fault() for breakpoints
The 8xx has a dedicated exception for breakpoints, that directly
calls do_break()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-02 19:20:12 +10:00
Christophe Leroy
da929f6af4 powerpc/mm: Evaluate user_mode(regs) only once in do_page_fault()
Analysis of the assembly code shows that when using user_mode(regs),
at least the 'andi.' is redone all the time, and also
the 'lwz ,132(r31)' most of the time. With the new form, the 'is_user'
is mapped to cr4, then all further use of is_user results in just
things like 'beq cr4,218 <do_page_fault+0x218>'

Without the patch:

  50:	81 1e 00 84 	lwz     r8,132(r30)
  54:	71 09 40 00 	andi.   r9,r8,16384
  58:	40 82 00 0c 	bne     64 <do_page_fault+0x64>

  84:	81 3e 00 84 	lwz     r9,132(r30)
  8c:	71 2a 40 00 	andi.   r10,r9,16384
  90:	41 a2 01 64 	beq     1f4 <do_page_fault+0x1f4>

  d4:	81 3e 00 84 	lwz     r9,132(r30)
  dc:	71 28 40 00 	andi.   r8,r9,16384
  e0:	41 82 02 08 	beq     2e8 <do_page_fault+0x2e8>

 108:	81 3e 00 84 	lwz     r9,132(r30)
 110:	71 28 40 00 	andi.   r8,r9,16384
 118:	41 82 02 28 	beq     340 <do_page_fault+0x340>

 1e4:	81 3e 00 84 	lwz     r9,132(r30)
 1e8:	71 2a 40 00 	andi.   r10,r9,16384
 1ec:	40 82 01 68 	bne     354 <do_page_fault+0x354>

 228:	81 3e 00 84 	lwz     r9,132(r30)
 22c:	71 28 40 00 	andi.   r8,r9,16384
 230:	41 82 ff c4 	beq     1f4 <do_page_fault+0x1f4>

 288:	71 2a 40 00 	andi.   r10,r9,16384
 294:	41 a2 fe 60 	beq     f4 <do_page_fault+0xf4>

 50c:	81 3e 00 84 	lwz     r9,132(r30)
 514:	71 2a 40 00 	andi.   r10,r9,16384
 518:	40 a2 fc e0 	bne     1f8 <do_page_fault+0x1f8>

 534:	81 3e 00 84 	lwz     r9,132(r30)
 53c:	71 2a 40 00 	andi.   r10,r9,16384
 540:	41 82 fc b8 	beq     1f8 <do_page_fault+0x1f8>

This patch creates a local var called 'is_user' which contains the
result of user_mode(regs)

With the patch:

  20:	81 03 00 84 	lwz     r8,132(r3)
  48:	55 09 97 fe 	rlwinm  r9,r8,18,31,31
  58:	2e 09 00 00 	cmpwi   cr4,r9,0
  5c:	40 92 00 0c 	bne     cr4,68 <do_page_fault+0x68>

  88:	41 b2 01 90 	beq     cr4,218 <do_page_fault+0x218>

  d4:	40 92 01 d0 	bne     cr4,2a4 <do_page_fault+0x2a4>

 120:	41 b2 00 f8 	beq     cr4,218 <do_page_fault+0x218>

 138:	41 b2 ff a0 	beq     cr4,d8 <do_page_fault+0xd8>

 1d4:	40 92 00 e0 	bne     cr4,2b4 <do_page_fault+0x2b4>

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-02 19:19:45 +10:00
Christophe Leroy
97a011e69b powerpc/mm: Remove a redundant test in do_page_fault()
The result of (trap == 0x400) is already in is_exec.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-02 19:18:34 +10:00
Christophe Leroy
e8de85ca32 powerpc/mm: Only call store_updates_sp() on stores in do_page_fault()
Function store_updates_sp() checks whether the faulting
instruction is a store updating r1. Therefore we can limit its calls
to store exceptions.

This patch is an improvement of commit a7a9dcd882 ("powerpc: Avoid
taking a data miss on every userspace instruction miss")

With the same microbenchmark app, run with 500 as argument, on an
MPC885 we get:

Before this patch: 152000 DTLB misses
After this patch:  147000 DTLB misses

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-02 19:10:24 +10:00
Christophe Leroy
9affa9e228 powerpc/mm: Remove __this_fixmap_does_not_exist()
This function has not been used since commit 9494a1e842
("powerpc: use generic fixmap.h)

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-02 19:09:53 +10:00
Balbir Singh
e63739b168 powerpc/mm/ptdump: Dump the first entry of the linear mapping as well
The check in hpte_find() should be < and not <= for PAGE_OFFSET

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-02 19:09:52 +10:00
Michael Ellerman
bfb9956ab4 powerpc/mm: Fix crash in page table dump with huge pages
The page table dump code doesn't know about huge pages, so currently
it crashes (or walks random memory, usually leading to a crash), if it
finds a huge page. On Book3S we only see huge pages in the Linux page
tables when we're using the P9 Radix MMU.

Teaching the code to properly handle huge pages is a bit more involved,
so for now just prevent the crash.

Cc: stable@vger.kernel.org # v4.10+
Fixes: 8eb07b1870 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-17 11:56:33 +10:00
Stephen Boyd
ad61dd303a scripts/spelling.txt: add regsiter -> register spelling mistake
This typo is quite common.  Fix it and add it to the spelling file so
that checkpatch catches it earlier.

Link: http://lkml.kernel.org/r/20170317011131.6881-2-sboyd@codeaurora.org
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Linus Torvalds
7246f60068 powerpc updates for 4.12 part 1.
Highlights include:
 
  - Larger virtual address space on 64-bit server CPUs. By default we use a 128TB
    virtual address space, but a process can request access to the full 512TB by
    passing a hint to mmap().
 
  - Support for the new Power9 "XIVE" interrupt controller.
 
  - TLB flushing optimisations for the radix MMU on Power9.
 
  - Support for CAPI cards on Power9, using the "Coherent Accelerator Interface
    Architecture 2.0".
 
  - The ability to configure the mmap randomisation limits at build and runtime.
 
  - Several small fixes and cleanups to the kprobes code, as well as support for
    KPROBES_ON_FTRACE.
 
  - Major improvements to handling of system reset interrupts, correctly treating
    them as NMIs, giving them a dedicated stack and using a new hypervisor call
    to trigger them, all of which should aid debugging and robustness.
 
 Many fixes and other minor enhancements.
 
 Thanks to:
   Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
   Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben
   Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian
   Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson,
   Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli,
   Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan,
   Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
   R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
   Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev
   Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler,
   Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZDHUMAAoJEFHr6jzI4aWAT7oQALkE2Nj3gjcn1z0SkFhq/1iO
 Py9Elmqm4E+L6NKYtBY5dS8xVAJ088ffzERyqJ1FY1LHkB8tn8bWRcMQmbjAFzTI
 V4TAzDNI890BN/F4ptrYRwNFxRBHAvZ4NDunTzagwYnwmTzW9PYHmOi4pvWTo3Tw
 KFUQ0joLSEgHzyfXxYB3fyj41u8N0FZvhfazdNSqia2Y5Vwwv/ION5jKplDM+09Y
 EtVEXFvaKAS1sjbM/d/Jo5rblHfR0D9/lYV10+jjyIokjzslIpyTbnj3izeYoM5V
 I4h99372zfsEjBGPPXyM3khL3zizGMSDYRmJHQSaKxjtecS9SPywPTZ8ufO/aSzV
 Ngq6nlND+f1zep29VQ0cxd3Jh40skWOXzxJaFjfDT25xa6FbfsWP2NCtk8PGylZ7
 EyqTuCWkMgIP02KlX3oHvEB2LRRPCDmRU2zECecRGNJrIQwYC2xjoiVi7Q8Qe8rY
 gr7Ib5Jj/a+uiTcCIy37+5nXq2s14/JBOKqxuYZIxeuZFvKYuRUipbKWO05WDOAz
 m/pSzeC3J8AAoYiqR0gcSOuJTOnJpGhs7zrQFqnEISbXIwLW+ICumzOmTAiBqOEY
 Rt8uW2gYkPwKLrE05445RfVUoERaAjaE06eRMOWS6slnngHmmnRJbf3PcoALiJkT
 ediqGEj0/N1HMB31V5tS
 =vSF3
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Larger virtual address space on 64-bit server CPUs. By default we
     use a 128TB virtual address space, but a process can request access
     to the full 512TB by passing a hint to mmap().

   - Support for the new Power9 "XIVE" interrupt controller.

   - TLB flushing optimisations for the radix MMU on Power9.

   - Support for CAPI cards on Power9, using the "Coherent Accelerator
     Interface Architecture 2.0".

   - The ability to configure the mmap randomisation limits at build and
     runtime.

   - Several small fixes and cleanups to the kprobes code, as well as
     support for KPROBES_ON_FTRACE.

   - Major improvements to handling of system reset interrupts,
     correctly treating them as NMIs, giving them a dedicated stack and
     using a new hypervisor call to trigger them, all of which should
     aid debugging and robustness.

   - Many fixes and other minor enhancements.

  Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
  Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
  Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
  Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
  Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
  Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
  Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
  Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
  R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
  Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
  Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
  Yang Shi"

* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
  powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
  powerpc/powernv: Fix TCE kill on NVLink2
  powerpc/mm/radix: Drop support for CPUs without lockless tlbie
  powerpc/book3s/mce: Move add_taint() later in virtual mode
  powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
  powerpc/smp: Document irq enable/disable after migrating IRQs
  powerpc/mpc52xx: Don't select user-visible RTAS_PROC
  powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
  powerpc/eeh: Clean up and document event handling functions
  powerpc/eeh: Avoid use after free in eeh_handle_special_event()
  cxl: Mask slice error interrupts after first occurrence
  cxl: Route eeh events to all drivers in cxl_pci_error_detected()
  cxl: Force context lock during EEH flow
  powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
  powerpc/xmon: Teach xmon oops about radix vectors
  powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
  powerpc/pseries: Enable VFIO
  powerpc/powernv: Fix iommu table size calculation hook for small tables
  powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
  powerpc: Add arch/powerpc/tools directory
  ...
2017-05-05 11:36:44 -07:00
Michael Ellerman
3c9ac2bcc3 powerpc/mm/radix: Drop support for CPUs without lockless tlbie
Currently the radix TLB code includes support for CPUs that do *not*
have MMU_FTR_LOCKLESS_TLBIE. On those CPUs we are required to take a
global spinlock before issuing a tlbie.

Radix can only be built for 64-bit Book3s CPUs, and of those, only
POWER4, 970, Cell and PA6T do not have MMU_FTR_LOCKLESS_TLBIE. Although
it's possible to build a kernel with Radix support that can also boot on
those CPUs, we happen to know that in reality none of those CPUs support
the Radix MMU, so the code can never actually run on those CPUs.

So remove the native_tlbie_lock in the Radix TLB code.

Note that there is another lock of the same name in the hash code, which
is unaffected by this patch.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 20:45:45 +10:00
Michael Ellerman
b13f6683ed Merge branch 'topic/ppc-kvm' into next
Merge the topic branch we were sharing with kvm-ppc, Paul has also
merged it.
2017-04-28 20:19:37 +10:00
Christophe Leroy
2505820f7c powerpc/mm: Rename table dump file name
Page table dump debugfs file is named 'kernel_page_tables' on
all other architectures implementing it, while is is named
'kernel_pagetables' on powerpc. This patch renames it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:28 +10:00
Christophe Leroy
78a18dbf01 powerpc/mm: On PPC32, display 32 bits addresses in page table dump
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:28 +10:00
Christophe Leroy
fd893fe56a powerpc/mm: Fix missing page attributes in page table dump
On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used.
There is also _PAGE_SHARED that is missing.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:27 +10:00
Christophe Leroy
6c01bbd2cf powerpc/mm: Fix page table dump build on PPC32
On PPC32 (eg. mpc885_ads_defconfig), page table dump compilation fails as
follows. This is because the memory layout is slightly different on PPC32. This
patch adapts it.

  arch/powerpc/mm/dump_linuxpagetables.c: In function 'walk_pagetables':
  arch/powerpc/mm/dump_linuxpagetables.c:369:10: error: 'KERN_VIRT_START' undeclared (first use in this function)
  ...

Fixes: 8eb07b1870 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:26 +10:00
Aneesh Kumar K.V
a5998fcb92 powerpc/mm/radix: Optimise tlbiel flush all case
_tlbiel_pid() is called with a ric (Radix Invalidation Control) argument of
either RIC_FLUSH_TLB or RIC_FLUSH_ALL.

RIC_FLUSH_ALL says to invalidate the entire TLB and the Page Walk Cache (PWC).

To flush the whole TLB, we have to iterate over each set (congruence class) of
the TLB. Currently we do that and pass RIC_FLUSH_ALL each time. That is not
incorrect but it means we flush the PWC 128 times, when once would suffice.

Fix it by doing the first flush with the ric value we're passed, and then if it
was RIC_FLUSH_ALL, we downgrade it to RIC_FLUSH_TLB, because we know we have
just flushed the PWC and don't need to do it again.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Split out of combined patch, tweak logic, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:21 +10:00
Aneesh Kumar K.V
cf4f08bed8 powerpc/mm/radix: Optimise Page Walk Cache flush
Currently we implement flushing of the page walk cache (PWC) by calling
_tlbiel_pid() with a RIC (Radix Invalidation Control) value of 1 which says to
only flush the PWC.

But _tlbiel_pid() loops over each set (congruence class) of the TLB, which is
not necessary when we're just flushing the PWC.

In fact the set argument is ignored for a PWC flush, so essentially we're just
flushing the PWC 127 extra times for no benefit.

Fix it by adding tlbiel_pwc() which just does a single flush of the PWC.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Split out of combined patch, drop _ in name, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:05 +10:00
Michael Ellerman
b409946b2a powerpc/mm: Fix possible out-of-bounds shift in arch_mmap_rnd()
The recent patch to add runtime configuration of the ASLR limits added a bug in
arch_mmap_rnd() where we may shift an integer (32-bits) by up to 33 bits,
leading to undefined behaviour.

In practice it exhibits as every process seg faulting instantly, presumably
because the rnd value hasn't been restricited by the modulus at all. We didn't
notice because it only happens under certain kernel configurations and if the
number of bits is actually set to a large value.

Fix it by switching to unsigned long.

Fixes: 9fea59bd7c ("powerpc/mm: Add support for runtime configuration of ASLR limits")
Reported-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-26 15:35:01 +10:00
Michael Ellerman
9fea59bd7c powerpc/mm: Add support for runtime configuration of ASLR limits
Add powerpc support for mmap_rnd_bits and mmap_rnd_compat_bits, which are two
sysctls that allow a user to configure the number of bits of randomness used for
ASLR.

Because of the way the Kconfig for ARCH_MMAP_RND_BITS is defined, we have to
construct at least the MIN value in Kconfig, vs in a header which would be more
natural. Given that we just go ahead and do it all in Kconfig.

At least according to the code (the documentation makes no mention of it), the
value is defined as the number of bits of randomisation *of the page*, not the
address. This makes some sense, with larger page sizes more of the low bits are
forced to zero, which would reduce the randomisation if we didn't take the
PAGE_SIZE into account. However it does mean the min/max values have to change
depending on the PAGE_SIZE in order to actually limit the amount of address
space consumed by the randomisation.

The result of that is that we have to define the default values based on both
32-bit vs 64-bit, but also the configured PAGE_SIZE. Furthermore now that we
have 128TB address space support on Book3S, we also have to take that into
account.

Finally we can wire up the value in arch_mmap_rnd().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2017-04-21 22:57:55 +10:00
Alexey Kardashevskiy
e889e96e98 powerpc/iommu: Do not call PageTransHuge() on tail pages
The CMA pages migration code does not support compound pages at
the moment so it performs few tests before proceeding to actual page
migration.

One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as
it is designed to be called on head pages only. Since we also test for
PageCompound(), and it contains PageTail() and PageHead(), we can
simplify the check by leaving just PageCompound() and therefore avoid
possible VM_BUG_ON_PAGE.

Fixes: 2e5bbb5461 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-19 20:00:20 +10:00
Aneesh Kumar K.V
321f7d29e5 powerpc/mmap: Any hint > 128TB searches the full VA space
As part of the new large address space support, processes start out life with a
128TB virtual address space. However when calling mmap() a process can pass a
hint address, and if that hint is > 128TB the kernel will use the full 512TB
address space to try and satisfy the mmap() request.

Currently we have a check that the hint is > 128TB and < 512TB (TASK_SIZE),
which was added as an optimisation to avoid updating addr_limit unnecessarily
and also to avoid calling slice_flush_segments() on all CPUs more than
necessary.

However this has the user-visible side effect that an mmap() hint above 512TB
does not search the full address space unless a preceding mmap() used a hint
value > 128TB && < 512TB.

So fix it to treat any hint above 128TB as a hint to search the full address
space, instead of checking the hint against TASK_SIZE, we instead check if the
addr_limit is already == TASK_SIZE.

This also brings the ABI in-line with what is proposed on x86. ie, that a hint
address above 128TB up to and including (2^64)-1 is an indication to search the
full address space.

Fixes: f4ea6dcb08 (powerpc/mm: Enable mappings above 128TB)
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-19 20:00:19 +10:00
Aneesh Kumar K.V
be77e999e3 powerpc/mm/radix: Use mm->task_size for boundary checking instead of addr_limit
We don't init addr_limit correctly for 32 bit applications. So default to using
mm->task_size for boundary condition checking. We use addr_limit to only control
free space search. This makes sure that we do the right thing with 32 bit
applications.

We should consolidate the usage of TASK_SIZE/mm->task_size and
mm->context.addr_limit later.

This partially reverts commit fbfef9027c (powerpc/mm: Switch some
TASK_SIZE checks to use mm_context addr_limit).

Fixes: fbfef9027c ("powerpc/mm: Switch some TASK_SIZE checks to use mm_context addr_limit")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-19 20:00:18 +10:00
Aneesh Kumar K.V
5f8122611b powerpc/mm/hash: Don't open code VMALLOC_INDEX
We have a #define for it, so use it.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-13 23:34:31 +10:00
Rashmica Gupta
9e4114b391 powerpc/mm: Fix hash table dump when memory is not contiguous
The current behaviour of the hash table dump assumes that memory is contiguous
and iterates from the start of memory to (start + size of memory). When memory
isn't physically contiguous, this doesn't work.

If memory exists at 0-5 GB and 6-10 GB then the current approach will check if
entries exist in the hash table from 0GB to 9GB. This patch changes the
behaviour to iterate over any holes up to the end of memory.

Fixes: 1515ab9321 ("powerpc/mm: Dump hash table")
Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-12 23:03:32 +10:00
Oliver O'Halloran
aaa2295292 powerpc/mm: Add physical address to Linux page table dump
The current page table dumper scans the Linux page tables and coalesces mappings
with adjacent virtual addresses and similar PTE flags. This behaviour is
somewhat broken when you consider the IOREMAP space where entirely unrelated
mappings will appear to be virtually contiguous. This patch modifies the range
coalescing so that only ranges that are both physically and virtually contiguous
are combined. This patch also adds to the dump output the physical address at
the start of each range.

Fixes: 8eb07b1870 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Print the physicall address with 0x like the other addresses]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-12 23:00:25 +10:00
Oliver O'Halloran
70538eaa70 powerpc/mm: Fix missing _PAGE_NON_IDEMPOTENT in pgtable dump
On Book3s we have two PTE flags used to mark cache-inhibited mappings:
_PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page table dumper
only looks at the generic _PAGE_NO_CACHE which is defined to be _PAGE_TOLERANT.
This patch modifies the dumper so both flags are shown in the dump.

Fixes: 8eb07b1870 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-12 22:47:27 +10:00
Anshuman Khandual
ea61455574 powerpc/mm: Remove reduntant initmem information from log
Generic core VM already prints these information in the log
buffer, hence there is no need for a second print. This just
removes the second print from arch powerpc NUMA init path.

Before the patch:

  $ dmesg | grep "Initmem"

  numa: Initmem setup node 0 [mem 0x00000000-0xffffffff]
  numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff]
  numa: Initmem setup node 2 [mem 0x200000000-0x2ffffffff]
  numa: Initmem setup node 3 [mem 0x300000000-0x3ffffffff]
  numa: Initmem setup node 4 [mem 0x400000000-0x4ffffffff]
  numa: Initmem setup node 5 [mem 0x500000000-0x5ffffffff]
  numa: Initmem setup node 6 [mem 0x600000000-0x6ffffffff]
  numa: Initmem setup node 7 [mem 0x700000000-0x7ffffffff]
  Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff]
  Initmem setup node 1 [mem 0x0000000100000000-0x00000001ffffffff]
  Initmem setup node 2 [mem 0x0000000200000000-0x00000002ffffffff]
  Initmem setup node 3 [mem 0x0000000300000000-0x00000003ffffffff]
  Initmem setup node 4 [mem 0x0000000400000000-0x00000004ffffffff]
  Initmem setup node 5 [mem 0x0000000500000000-0x00000005ffffffff]
  Initmem setup node 6 [mem 0x0000000600000000-0x00000006ffffffff]
  Initmem setup node 7 [mem 0x0000000700000000-0x00000007ffffffff]

After the patch just the latter set is printed.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11 07:46:05 +10:00
Michael Ellerman
4868e3508d powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit()
setup_initial_memory_limit() is called from early_init_devtree(), which
runs prior to feature patching. If the kernel is built with CONFIG_JUMP_LABEL=y
and CONFIG_JUMP_LABEL_FEATURE_CHECKS=y then we will potentially get the
wrong value.

If we also have CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG=y we get a warning
and backtrace:

  Warning! mmu_has_feature() used prior to jump label init!
  CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc4-gccN-next-20170331-g6af2434 #1
  Call Trace:
  [c000000000fc3d50] [c000000000a26c30] .dump_stack+0xa8/0xe8 (unreliable)
  [c000000000fc3de0] [c00000000002e6b8] .setup_initial_memory_limit+0xa4/0x104
  [c000000000fc3e60] [c000000000d5c23c] .early_init_devtree+0xd0/0x2f8
  [c000000000fc3f00] [c000000000d5d3b0] .early_setup+0x90/0x11c
  [c000000000fc3f90] [c000000000000520] start_here_multiplatform+0x68/0x80

Fix it by using early_mmu_has_feature().

Fixes: c12e6f24d4 ("powerpc: Add option to use jump label for mmu_has_feature()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11 07:46:04 +10:00
Michael Ellerman
7644d5819c powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there
powerpc_debugfs_root is the dentry representing the root of the
"powerpc" directory tree in debugfs.

Currently it sits in asm/debug.h, a long with some other things that
have "debug" in the name, but are otherwise unrelated.

Pull it out into a separate header, which also includes linux/debugfs.h,
and convert all the users to include debugfs.h instead of debug.h.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11 07:46:03 +10:00
Aneesh Kumar K.V
f7327e0ba3 powerpc/mm/radix: Remove unnecessary ptesync
For a tlbiel with pid, we need to issue tlbiel with set number encoded. We
don't need to do ptesync for each of those. Instead we need one for the entire
tlbiel pid operation.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11 07:46:02 +10:00
Aneesh Kumar K.V
f6b0df55ca powerpc/mm/radix: Don't do page walk cache flush when doing full mm flush
For fullmm tlb flush, we do a flush with RIC_FLUSH_ALL which will invalidate all
related caches (radix__tlb_flush()). Hence the pwc flush is not needed.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11 07:46:02 +10:00
Frederic Barrat
88b1bf7268 powerpc/mm: Add missing global TLB invalidate if cxl is active
Commit 4c6d9acce1 ("powerpc/mm: Add hooks for cxl") converted local
TLB invalidates to global if the cxl driver is active. This is necessary
because the CAPP snoops invalidations to forward them to the PSL on the
cxl adapter. However one path was forgotten. native_flush_hash_range()
still does local TLB invalidates, as found out the hard way recently.

This patch fixes it by following the same logic as previously: if the
cxl driver is active, the local TLB invalidates are 'upgraded' to
global.

Fixes: 4c6d9acce1 ("powerpc/mm: Add hooks for cxl")
Cc: stable@vger.kernel.org # v3.18+
Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-05 22:13:37 +10:00
Alistair Popple
1ab66d1fba powerpc/powernv: Introduce address translation services for Nvlink2
Nvlink2 supports address translation services (ATS) allowing devices
to request address translations from an mmu known as the nest MMU
which is setup to walk the CPU page tables.

To access this functionality certain firmware calls are required to
setup and manage hardware context tables in the nvlink processing unit
(NPU). The NPU also manages forwarding of TLB invalidates (known as
address translation shootdowns/ATSDs) to attached devices.

This patch exports several methods to allow device drivers to register
a process id (PASID/PID) in the hardware tables and to receive
notification of when a device should stop issuing address translation
requests (ATRs). It also adds a fault handler to allow device drivers
to demand fault pages in.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
[mpe: Fix up comment formatting, use flush_tlb_mm()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-04 13:27:26 +10:00
Oliver O'Halloran
f6f9195ba0 powerpc/mm: Remove stale comment about the DART hole
The code to fix the problem it describes was removed in commit
c40785ad30 ("powerpc/dart: Use a cachable DART"), and it uses the
stupid comment style. Away it goooooooooooooes!

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-03 23:34:48 +10:00
Anton Blanchard
a7a9dcd882 powerpc: Avoid taking a data miss on every userspace instruction miss
Early on in do_page_fault() we call store_updates_sp(), regardless of
the type of exception. For an instruction miss this doesn't make
sense, because we only use this information to detect if a data miss
is the result of a stack expansion instruction or not.

Worse still, it results in a data miss within every userspace
instruction miss handler, because we try and load the very instruction
we are about to install a pte for!

A simple exec microbenchmark runs 6% faster on POWER8 with this fix:

 #include <stdlib.h>
 #include <stdio.h>
 #include <unistd.h>

int main(int argc, char *argv[])
{
	unsigned long left = atol(argv[1]);
	char leftstr[16];

	if (left-- == 0)
		return 0;

	sprintf(leftstr, "%ld", left);
	execlp(argv[0], argv[0], leftstr, NULL);
	perror("exec failed\n");

	return 0;
}

Pass the number of iterations on the command line (eg 10000) and time
how long it takes to execute.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-03 23:33:05 +10:00
Aneesh Kumar K.V
f4ea6dcb08 powerpc/mm: Enable mappings above 128TB
Not all user space application is ready to handle wide addresses. It's
known that at least some JIT compilers use higher bits in pointers to
encode their information. It collides with valid pointers with 512TB
addresses and leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 128TB by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 128TB.

If hint address set above 128TB, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 128TB window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

This is going to be a per mmap decision. ie, we can have some mmaps with
larger addresses and other that do not.

A sample memory layout looks like:

  10000000-10010000 r-xp 00000000 fc:00 9057045          /home/max_addr_512TB
  10010000-10020000 r--p 00000000 fc:00 9057045          /home/max_addr_512TB
  10020000-10030000 rw-p 00010000 fc:00 9057045          /home/max_addr_512TB
  10029630000-10029660000 rw-p 00000000 00:00 0          [heap]
  7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0
  7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
  7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
  7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
  7fff83690000-7fff836a0000 rw-p 00000000 00:00 0
  7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0        [vdso]
  7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
  7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
  7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
  7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0        [stack]
  1000000000000-1000000010000 rw-p 00000000 00:00 0
  1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01 21:12:29 +11:00
Aneesh Kumar K.V
fbfef9027c powerpc/mm: Switch some TASK_SIZE checks to use mm_context addr_limit
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01 21:12:28 +11:00
Aneesh Kumar K.V
82228e362f powerpc/pseries: Skip using reserved virtual address range
Now that we use all the available virtual address range, we need to make
sure we don't generate VSID such that it overlaps with the reserved vsid
range. Reserved vsid range include the virtual address range used by the
adjunct partition and also the VRMA virtual segment. We find the context
value that can result in generating such a VSID and reserve it early in
boot.

We don't look at the adjunct range, because for now we disable the
adjunct usage in a Linux LPAR via CAS interface.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rewrite hash__reserve_context_id(), move the rest into pseries]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01 21:12:27 +11:00
Aneesh Kumar K.V
bb1832217a powerpc/mm/hash: Store addr_limit in PACA
We optmize the slice page size array copy to paca by copying only the
range based on addr_limit. This will require us to not look at page size
array beyond addr_limit in PACA on slb fault. To enable that copy task
size to paca which will be used during slb fault.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rename from task_size to addr_limit, consolidate #ifdefs]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01 21:12:27 +11:00
Aneesh Kumar K.V
957b778a16 powerpc/mm: Add addr_limit to mm_context and use it to derive max slice index
In the followup patch, we will increase the slice array size to handle
512TB range, but will limit the max addr to 128TB. Avoid doing
unnecessary computation and avoid doing slice mask related operation
above address limit.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01 21:12:20 +11:00
Aneesh Kumar K.V
e6f81a9201 powerpc/mm/hash: Support 68 bit VA
Inorder to support large effective address range (512TB), we want to
increase the virtual address bits to 68. But we do have platforms like
p4 and p5 that can only do 65 bit VA. We support those platforms by
limiting context bits on them to 16.

The protovsid -> vsid conversion is verified to work with both 65 and 68
bit va values. I also documented the restrictions in a table format as
part of code comments.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:10:00 +11:00
Aneesh Kumar K.V
941711a3a0 powerpc/mm/hash: Use context ids 1-4 for the kernel
Currently we use the top 4 context ids (0x7fffc-0x7ffff) for the kernel.
Kernel VSIDs are built using these top context values and effective the
segement ID. In subsequent patches we want to increase the max effective
address to 512TB. We will achieve that by increasing the effective
segment IDs there by increasing virtual address range.

We will be switching to a 68bit virtual address in the following patch.
But platforms like Power4 and Power5 only support a 65 bit virtual
address. We will handle that by limiting the context bits to 16 instead
of 19 on those platforms. That means the max context id will have a
different value on different platforms.

So that we don't have to deal with the kernel context ids changing
between different platforms, move the kernel context ids down to use
context ids 1-4.

We can't use segment 0 of context-id 0, because that maps to VSID 0,
which we want to keep as invalid, so we avoid context-id 0 entirely.
Similarly we can't use the last segment of the maximum context, so we
avoid it too.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Switch from 0-3 to 1-4 so VSID=0 remains invalid]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:59 +11:00
Michael Ellerman
760573c1a9 powerpc/mm: Split radix vs hash mm context initialisation
Complete the split of the radix vs hash mm context initialisation.

This is mostly code movement, with the exception that we now limit the
context allocation to PRTB_ENTRIES - 1 on radix.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:58 +11:00
Michael Ellerman
c1ff840d21 powerpc/mm/hash: Pull hash constants into hash__alloc_context_id()
The min and max context id values used in alloc_context_id() are
currently the right values for use on hash, and happen to also be safe
for use on radix.

But we need to change that in a subsequent patch, so make the min/max
ids parameters and pull the hash values into hsah__alloc_context_id().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:58 +11:00
Michael Ellerman
a336f2f5b0 powerpc/mm/hash: Abstract context id allocation for KVM
KVM wants to be able to allocate an MMU context id, which it does
currently by calling __init_new_context().

We're about to rework that code, so provide a wrapper for KVM so it
can not worry about the details.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:57 +11:00
Aneesh Kumar K.V
302413cad5 powerpc/mm/slice: Update slice mask printing to use bitmap printing.
We now get output like below which is much better.

[    0.935306]  good_mask low_slice: 0-15
[    0.935360]  good_mask high_slice: 0-511

Compared to

[    0.953414]  good_mask:1111111111111111 - 1111111111111.........

I also fixed an error with slice_dbg printing.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:56 +11:00
Aneesh Kumar K.V
82185222ff powerpc/mm/slice: Move slice_mask struct definition to slice.c
This structure definition need not be in a header since this is used only by
slice.c file. So move it to slice.c. This also allow us to use SLICE_NUM_HIGH
instead of 64.

I also switch the low_slices type to u64 from u16. This doesn't have an impact
on size of struct due to padding added with u16 type. This helps in using
bitmap printing function for printing slice mask.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:56 +11:00
Aneesh Kumar K.V
1b2fa5a046 powerpc/mm: Remove checks that TASK_SIZE_USER64 is too small
Remove the checks that TASK_SIZE_USER64 is smaller than H_PGTABLE_RANGE
and USER_VSID_RANGE.

In a following patch we will deliberately add support for a TASK_SIZE
smaller than both ranges, so this will no longer be an error condition.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Keep the check in pgtable_64.c that we don't exceed USER_VSID_RANGE]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:55 +11:00
Aneesh Kumar K.V
52b1e66587 powerpc/mm: Move copy_mm_to_paca to paca.c
We also update the function arg to struct mm_struct. Move this so that function
finds the definition of struct mm_struct. No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:54 +11:00
Aneesh Kumar K.V
a4d3621503 powerpc/mm/slice: Update the function prototype
This avoid copying the slice_mask struct as function return value

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:54 +11:00
Aneesh Kumar K.V
f3207c124e powerpc/mm/slice: Convert slice_mask high slice to a bitmap
In followup patch we want to increase the va range which will result
in us requiring high_slices to have more than 64 bits. To enable this
convert high_slices to bitmap. We keep the number bits same in this patch
and later change that to higher value

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Fold in fix to use bitmap_empty()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:53 +11:00
Aneesh Kumar K.V
6aa59f5162 powerpc/mm: Move hash specific pte bits to be top bits of RPN
We don't support the full 57 bits of physical address and hence can
overload the top bits of RPN as hash specific pte bits.

Add a BUILD_BUG_ON() to enforce the relationship between H_PAGE_F_SECOND
and H_PAGE_F_GIX.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Move the BUILD_BUG_ON() into hash_utils_64.c and comment it]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:52 +11:00
Aneesh Kumar K.V
a525108cf1 powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout
Without this if firmware reports 1MB page size support we will crash
trying to use 1MB as hugetlb page size.

echo 300 > /sys/kernel/mm/hugepages/hugepages-1024kB/nr_hugepages

kernel BUG at ./arch/powerpc/include/asm/hugetlb.h:19!
.....
....
[c0000000e2c27b30] c00000000029dae8 .hugetlb_fault+0x638/0xda0
[c0000000e2c27c30] c00000000026fb64 .handle_mm_fault+0x844/0x1d70
[c0000000e2c27d70] c00000000004805c .do_page_fault+0x3dc/0x7c0
[c0000000e2c27e30] c00000000000ac98 handle_page_fault+0x10/0x30

With fix, we don't enable 1MB as hugepage size.

bash-4.2# cd /sys/kernel/mm/hugepages/
bash-4.2# ls
hugepages-16384kB  hugepages-16777216kB

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:50 +11:00
Aneesh Kumar K.V
ddb014b68d powerpc/mm/radix: rename _PAGE_LARGE to R_PAGE_LARGE
This bit is only used by radix and it is nice to follow the naming style of having
bit name start with H_/R_ depending on which translation mode they are used.

No functional change in this patch.

Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:49 +11:00
Aneesh Kumar K.V
98beda74de powerpc/mm/slice: Fix off-by-1 error when computing slice mask
For low slice, max addr should be less than 4G. Without limiting this correctly
we will end up with a low slice mask which has 17th bit set. This is not
a problem with the current code because our low slice mask is of type u16. But
in later patch I am switching low slice mask to u64 type and having the 17bit
set result in wrong slice mask which in turn results in mmap failures.

Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:48 +11:00
Aneesh Kumar K.V
b42279f016 powerpc/mm/nohash: MM_SLICE is only used by book3s 64
BOOKE code is dead code as per the Kconfig details. So make it simpler
by enabling MM_SLICE only for book3s_64. The changes w.r.t nohash is just
removing deadcode. W.r.t ppc64, 4k without hugetlb will now enable MM_SLICE.
But that is good, because we reduce one extra variant which probably is not
getting tested much.

Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:47 +11:00
Alexey Kardashevskiy
6b5c19c552 powerpc/mmu: Add real mode support for IOMMU preregistered memory
This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-27 15:25:48 +11:00
Ben Hutchings
3072601375 powerpc/32: Remove Mac-on-Linux/rtlinux hooks
The symbols exported for use by MOL/rtlinux aren't getting CRCs and I
was about to fix that. But MOL is dead upstream, and the latest work on
it was to make it use KVM instead of its own kernel module. So remove
them instead.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-21 22:09:26 +11:00
Laurent Dufour
819cdcdb7b powerpc/mm: Move mmap_sem unlocking in do_page_fault()
Since the fault retry is now handled earlier, we can release the
mmap_sem lock earlier too and remove later unlocking previously done in
mm_fault_error().

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-21 22:09:18 +11:00
Laurent Dufour
14c02e419a powerpc/mm: Handle VM_FAULT_RETRY earlier
In do_page_fault() if handle_mm_fault() returns VM_FAULT_RETRY, retry
the page fault handling before anything else.

This would simplify the handling of the mmap_sem lock in this part of
the code.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-21 22:09:11 +11:00
Laurent Dufour
c2294e0ffe powerpc/mm: Move mmap_sem unlock up from do_sigbus
Move mmap_sem releasing in the do_sigbus()'s unique caller : mm_fault_error()

No functional changes.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-21 22:08:53 +11:00
Paul Mackerras
fc36a90326 Revert "powerpc/64: Disable use of radix under a hypervisor"
This reverts commit 3f91a89d42.

Now that we do have the machinery for using the radix MMU under a
hypervisor, the extra check and comment introduced in 3f91a89d42 are
no longer correct.  The result is that when booted under a hypervisor
that only allows use of radix, we clear the MMU_FTR_TYPE_RADIX and
then set it again, and print a warning about ignoring the
disable_radix command line option, even though the command line does
not include "disable_radix".

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-21 16:50:28 +11:00
Linus Torvalds
f7d6a7283a powerpc fixes for 4.11 #3
Five fairly small fixes for things that went in this cycle.
 
 A fairly large patch to rework the CAS logic on Power9, necessitated by a late
 change to the firmware API, and we can't boot without it.
 
 Three fixes going to stable, allowing more instructions to be emulated on LE,
 fixing a boot crash on 32-bit Freescale BookE machines, and the OPAL XICS
 workaround.
 
 And a patch from me to sort the selects under CONFIG PPC. Annoying churn, but
 worth it in the long run, and best for it to go in now to avoid conflicts.
 
 Thanks to:
   Alexey Kardashevskiy, Anton Blanchard, Balbir Singh, Gautham R. Shenoy,
   Laurentiu Tudor, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Sachin Sant,
   Shile Zhang, Suraj Jitindar Singh.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYvqSxAAoJEFHr6jzI4aWAjMQP/06OFGz3VQvO5Q8jPsqRF22y
 Wr+04OKFmKnYVObdQk15HGOagp1fSkWWHfP/eu50kx1WNCzq7tQdLjNSi7H4F3s1
 4NwlaOfSQoxctsVtfnITJkfVScjcxK7XVagswtb3wvBpBx4lwD8fGwxkSxj6NhRw
 PNxLi44wobb8mDyR6L/6tJKBI2Jt12qXZY+kBQIleun5+lF8fNXIu4qPiglMOia6
 oPhXlp4RASt8wz74H8JuMTwGv17MxG+zvbkDPwQC7PI/fohJLybgWEfByN4H5UMy
 7Xi/lWHlShAyc7ulAIN+A1mHKY9LSv45U6qrrHFUJgRftZihoZHe6ekcI+h5oFVX
 chP9oUrQNeeZ5QqUC4rYdWwsMfiXBI0y5+BCupItixXc1LANBH9Ym9IECbgPRP93
 LQVqiS4958KijHlYBOA2zPicl/FnVO16orqakyRS0B3lQ54XBvhcgG8gIXjQr8PM
 Mt2W4r6RtGJ4ddhUPpF/W4lEuR4+dmXfEqs7DkgBKRbvi8XYkiLx2byBNh/OMRUG
 T4ILXsYf50AKRAq/jFTs9A0zkjtmtBeDdn96Mcan8i3WZuTQ7b8mQlC46zEg23A8
 XmTG2xt7N1dMjjwS78CfnvQ8sIVtA9AUfK37aTc0ICMsBCqEcWLAhHKZyCw0h25C
 wq9BMn4e5Gdg2xLTHKlL
 =SxON
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Five fairly small fixes for things that went in this cycle.

  A fairly large patch to rework the CAS logic on Power9, necessitated
  by a late change to the firmware API, and we can't boot without it.

  Three fixes going to stable, allowing more instructions to be emulated
  on LE, fixing a boot crash on 32-bit Freescale BookE machines, and the
  OPAL XICS workaround.

  And a patch from me to sort the selects under CONFIG PPC. Annoying
  churn, but worth it in the long run, and best for it to go in now to
  avoid conflicts.

  Thanks to:
    Alexey Kardashevskiy, Anton Blanchard, Balbir Singh, Gautham R.
    Shenoy, Laurentiu Tudor, Nicholas Piggin, Paul Mackerras, Ravi
    Bangoria, Sachin Sant, Shile Zhang, Suraj Jitindar Singh"

* tag 'powerpc-4.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc: Sort the selects under CONFIG_PPC
  powerpc/64: Fix L1D cache shape vector reporting L1I values
  powerpc/64: Avoid panic during boot due to divide by zero in init_cache_info()
  powerpc: Update to new option-vector-5 format for CAS
  powerpc: Parse the command line before calling CAS
  powerpc/xics: Work around limitations of OPAL XICS priority handling
  powerpc/64: Fix checksum folding in csum_add()
  powerpc/powernv: Fix opal tracepoints with JUMP_LABEL=n
  powerpc/booke: Fix boot crash due to null hugepd
  powerpc: Fix compiling a BE kernel with a powerpc64le toolchain
  selftest/powerpc: Fix false failures for skipped tests
  powerpc/powernv: Fix bug due to labeling ambiguity in power_enter_stop
  powerpc/64: Invalidate process table caching after setting process table
  powerpc: emulate_step() tests for load/store instructions
  powerpc: Emulation support for load/store instructions on LE
2017-03-07 10:46:10 -08:00
Suraj Jitindar Singh
014d02cbf1 powerpc: Update to new option-vector-5 format for CAS
On POWER9 the ibm,client-architecture-support (CAS) negotiation process
has been updated to change how the host to guest negotiation is done for
the new hash/radix mmu as well as the nest mmu, process tables and guest
translation shootdown (GTSE).

This is documented in the unreleased PAPR ACR "CAS option vector
additions for P9".

The host tells the guest which options it supports in
ibm,arch-vec-5-platform-support. The guest then chooses a subset of these
to request in the CAS call and these are agreed to in the
ibm,architecture-vec-5 property of the chosen node.

Thus we read ibm,arch-vec-5-platform-support and make our selection before
calling CAS. We then parse the ibm,architecture-vec-5 property of the
chosen node to check whether we should run as hash or radix.

ibm,arch-vec-5-platform-support format:

index value pairs: <index, val> ... <index, val>

index: Option vector 5 byte number
val:   Some representation of supported values

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Don't print about unknown options, be consistent with OV5_FEAT]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-06 21:44:09 +11:00
Paul Mackerras
7a70d7288c powerpc/64: Invalidate process table caching after setting process table
The POWER9 MMU reads and caches entries from the process table.
When we kexec from one kernel to another, the second kernel sets
its process table pointer but doesn't currently do anything to
make the CPU invalidate any cached entries from the old process table.
This adds a tlbie (TLB invalidate entry) instruction with parameters
to invalidate caching of the process table after the new process
table is installed.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-03 11:24:50 +11:00
Ingo Molnar
589ee62844 sched/headers: Prepare to remove the <linux/mm_types.h> dependency from <linux/sched.h>
Update code that relied on sched.h including various MM types for them.

This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Ingo Molnar
68db0cf106 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:36 +01:00
Ingo Molnar
010426079e sched/headers: Prepare for new header dependencies before moving more code to <linux/sched/mm.h>
We are going to split more MM APIs out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.

The APIs that we are going to move are:

  arch_pick_mmap_layout()
  arch_get_unmapped_area()
  arch_get_unmapped_area_topdown()
  mm_update_next_owner()

Include the header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Linus Torvalds
b286cedd47 powerpc updates for 4.11 part 2
Highlights include:
 
  - An update of the disassembly code used by xmon to the latest versions in
    binutils. We've received permission from all the authors of the relevant
    binutils changes to relicense their changes to the relevant files from GPLv3
    to GPLv2, for inclusion in Linux. Thanks to Peter Bergner for doing the leg
    work to get permission from everyone.
 
  - Addition of the "architected" Power9 CPU table entry, allowing us to boot
    in Power9 architected mode under a hypervisor.
 
  - Updates to the Power9 PMU code.
 
  - Implementation of clear_bit_unlock_is_negative_byte() to optimise
    unlock_page().
 
  - Freescale updates from Scott: "Highlights include 8xx breakpoints and perf,
    t1042rdb display support, and board updates."
 
 Thanks to:
   Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas Miller,
   Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan, Michael Roth, Nathan
   Fontenot, Naveen N. Rao, Nicholas Piggin, Peter Bergner, Paul E. McKenney,
   Rashmica Gupta, Russell Currey, Sahil Mehta, Stewart Smith.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYthsKAAoJEFHr6jzI4aWAaWMQAJ7mAwX98ncoYschPgRmmIun
 f6DtE4IonrxiZ22gp1ct4+c9OFtA+B5FXMcEhOKpfh93lg38PTDjHs9e5kfauD7+
 oTQ2Bg1eXaL48FKdmC5Vs4Kt+/J8e9guGafUC1OVIpTyyRPoZeUDH0lx+kSPV5bd
 PkL+wY/k3W0Njo8WgD1P9u3W15+BxISo/k8c7ajzKTHGBZlAvj5h2gO6XUBNMLyy
 YClB/qIymjZriSB+AeWYD79k8gPbBZPsmZG0ZF1hY060894LgqLB9mPOJdffx/DY
 H7/uP6jcsRDOXTOmyueW1SEmPoQbtysiMd1lNrCXKtC/Okr5uhn2cUhi88AsgWvd
 1QFly2lobcDAKPah/yB7YQGMAcmYvGGNuqrWaosaV2T7r0KprzUYYgCOqzvC3WSJ
 QtVatBzMIqRTMYq+3U4G1aHeCXlRazVQHDuvPby8RdR5b2gIexiqMab2eS7tSMIH
 mCOIunRIvT14g/7wxUV7tahN+ifncNxzAk4DvPO+Wc4FQ4sy7wArv2YipSaWRWtE
 u7tNdBkEwlDkKhJgRU5T0Op2PyMbHwCP8pWuz7PQIhKIcgwmP9wb07BIWG/GGIqn
 07TxJYX2ItabyEMZMsYhzILZqjLyiAaCARANB7ScbQbdP8wdcGZcwismhwnfROIU
 NuxsZg63BUDMoxk7Sauu
 =rspd
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull more powerpc updates from Michael Ellerman:
 "Highlights include:

   - an update of the disassembly code used by xmon to the latest
     versions in binutils. We've received permission from all the
     authors of the relevant binutils changes to relicense their changes
     to the relevant files from GPLv3 to GPLv2, for inclusion in Linux.
     Thanks to Peter Bergner for doing the leg work to get permission
     from everyone.

   - addition of the "architected" Power9 CPU table entry, allowing us
     to boot in Power9 architected mode under a hypervisor.

   - updates to the Power9 PMU code.

   - implementation of clear_bit_unlock_is_negative_byte() to optimise
     unlock_page().

   - Freescale updates from Scott: "Highlights include 8xx breakpoints
     and perf, t1042rdb display support, and board updates."

  Thanks to:
    Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas
    Miller, Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan,
    Michael Roth, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Peter
    Bergner, Paul E. McKenney, Rashmica Gupta, Russell Currey, Sahil
    Mehta, Stewart Smith"

* tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
  powerpc: Remove leftover cputime_to_nsecs call causing build error
  powerpc/mm/hash: Always clear UPRT and Host Radix bits when setting up CPU
  powerpc/optprobes: Fix TOC handling in optprobes trampoline
  powerpc/pseries: Advertise Hot Plug Event support to firmware
  cxl: fix nested locking hang during EEH hotplug
  powerpc/xmon: Dump memory in CPU endian format
  powerpc/pseries: Revert 'Auto-online hotplugged memory'
  powerpc/powernv: Make PCI non-optional
  powerpc/64: Implement clear_bit_unlock_is_negative_byte()
  powerpc/powernv: Remove unused variable in pnv_pci_sriov_disable()
  powerpc/kernel: Remove error message in pcibios_setup_phb_resources()
  powerpc/mm: Fix typo in set_pte_at()
  pci/hotplug/pnv-php: Disable MSI and PCI device properly
  pci/hotplug/pnv-php: Disable surprise hotplug capability on conflicts
  pci/hotplug/pnv-php: Remove WARN_ON() in pnv_php_put_slot()
  powerpc: Add POWER9 architected mode to cputable
  powerpc/perf: use is_kernel_addr macro in perf_get_misc_flags()
  powerpc/perf: Avoid FAB_*_MATCH checks for power9
  powerpc/perf: Add restrictions to PMC5 in power9 DD1
  powerpc/perf: Use Instruction Counter value
  ...
2017-03-01 10:10:16 -08:00
Linus Torvalds
38705613b7 powerpc updates for 4.11 part 1.
Highlights include:
 
  - Support for direct mapped LPC on POWER9, giving Linux direct access to
    devices that may be on there such as a UART.
 
  - Memory hotplug support for the Power9 Radix MMU.
 
  - Add new AUX vectors describing the processor's cache geometry, to be used by
    glibc.
 
  - The ability for a guest to ask the hypervisor to resize the guest's hash
    table, and in addition support for doing so automatically when memory is
    hotplugged into/out-of the guest. This allows the hash table to be sized
    based on the current memory usage of the guest, rather than the maximum
    possible memory usage.
 
  - Implementation of optprobes (kprobe optimisation) for powerpc.
 
 In addition there's the topic branch shared with the KVM tree, which includes
 support for guests to use the Radix MMU on Power9.
 
 Thanks to:
   Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton Blanchard,
   Benjamin Herrenschmidt, Chris Packham, Daniel Axtens, Daniel Borkmann, David
   Gibson, Finn Thain, Gautham R. Shenoy, Gavin Shan, Greg Kurz, Joel Stanley,
   John Allen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
   Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi
   Bangoria, Reza Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYrWj0AAoJEFHr6jzI4aWAsn4P/08Kz3TtOvDuuPGVNoO7fWOn
 ag5/zVt8R4FuCALqWpAZbVqMuUU4wLxG0RuWlmBNYYhrjMC6JxHpOSjQXxM2D7YT
 CdGTJxG414r6HMOeToL9i/z33o0m+KT07tscer+QMKlXVKCR2z0fEJch+zoPBHYA
 5eVBFpLLTtbNiX9UnrcM/vYz61d56kT4YJey9/8qbAkTAc1rMPa8ucU5UiKYJ7yX
 mF8cd7WE+7aqif9V8yN59G2rcbz+h3pbMw/gzImiYsYrUj4fLjU+VTKL5PPT+UFy
 WWBXD3MAMm1dksMMZi3hgoo2BZhDn3RkymeYi6Jo4kDknNMPZzkMxGyvaJ8Eq5H/
 bIYXdS1AbTtvaaEEuWqDFjpnChOEvj/8IeqitU0jjlql8BVjNKg/ESaaKucZr+pO
 pk2Mfvw0Tb/lxJT5qj27yq4aRsxJwdFOoPYCN7MquPp/wV2Tg5M6h4nVQ4T6Wl0S
 tMFQeCqXflhDWh0Xgr2UXpF66/YTj3Du5LasOTkgGeU30Z8TcNGFEmDWShKP3cEm
 e0oQE+OHhIGN4WSBXBAto/Gw8/0v3pXlMs+VEfeHqenwPss2sWtSwXWe8khmiy9e
 DJ48sTzj75/Zx1fiqRldw9YEnrL+NK0eOOpvzxeyKpfvdUytc+chFfEqUmO6kl3Z
 DW2UmlZxmW+b0SfexCHL
 =Icle
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Support for direct mapped LPC on POWER9, giving Linux direct access
     to devices that may be on there such as a UART.

   - Memory hotplug support for the Power9 Radix MMU.

   - Add new AUX vectors describing the processor's cache geometry, to
     be used by glibc.

   - The ability for a guest to ask the hypervisor to resize the guest's
     hash table, and in addition support for doing so automatically when
     memory is hotplugged into/out-of the guest. This allows the hash
     table to be sized based on the current memory usage of the guest,
     rather than the maximum possible memory usage.

   - Implementation of optprobes (kprobe optimisation) for powerpc.

  In addition there's the topic branch shared with the KVM tree, which
  includes support for guests to use the Radix MMU on Power9.

  Thanks to:
    Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton
    Blanchard, Benjamin Herrenschmidt, Chris Packham, Daniel Axtens,
    Daniel Borkmann, David Gibson, Finn Thain, Gautham R. Shenoy, Gavin
    Shan, Greg Kurz, Joel Stanley, John Allen, Madhavan Srinivasan,
    Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Nathan Fontenot,
    Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Reza
    Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun"

* tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (129 commits)
  powerpc/mm/radix: Skip ptesync in pte update helpers
  powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mm
  powerpc/mm/radix: Update pte update sequence for pte clear case
  powerpc/mm: Update PROTFAULT handling in the page fault path
  powerpc/xmon: Fix data-breakpoint
  powerpc/mm: Fix build break with BOOK3S_64=n and MEMORY_HOTPLUG=y
  powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=y
  powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n
  powerpc/pseries: Fix typo in parameter description
  powerpc/kprobes: Remove kprobe_exceptions_notify()
  kprobes: Introduce weak variant of kprobe_exceptions_notify()
  powerpc/ftrace: Fix confusing help text for DISABLE_MPROFILE_KERNEL
  powerpc/powernv: Fix opal_exit tracepoint opcode
  powerpc: Add a prototype for mcount() so it can be versioned
  powerpc: Drop GPL from of_node_to_nid() export to match other arches
  powerpc/kprobes: Optimize kprobe in kretprobe_trampoline()
  powerpc/kprobes: Implement Optprobes
  powerpc/kprobes: Fixes for kprobe_lookup_name() on BE
  powerpc: Add helper to check if offset is within relative branch range
  powerpc/bpf: Introduce __PPC_SH64()
  ...
2017-02-22 10:30:38 -08:00
Gavin Shan
c618f6b188 powerpc/mm: Fix typo in set_pte_at()
This fixes the typo about the _PAGE_PTE in set_pte_at() by changing
"tryint" to "trying to".

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-17 22:16:25 +11:00
Michael Ellerman
a90e883d8b powerpc/mm: Blacklist SLB symbols from kprobe
We can't sensibly take a trap at this point. So, blacklist these
symbols.

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-17 10:58:52 +11:00
Michael Ellerman
e471c393df powerpc/mm: Convert slb_finish_load[_1T] to local symbols
slb_finish_load and slb_finish_load_1T are both only used within
slb_low.S, so make them local symbols.

This makes the code a little clearer, as it's more obvious neither is
intended to be an entry point from arbitrary other code, only the uses
in this file.

It also prevents them being used with kprobes and other tracing tools,
which is good because we're not able to safely take traps at these
locations, so making them local symbols avoids us needing to blacklist
them.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-17 10:58:51 +11:00
Paul Mackerras
3f91a89d42 powerpc/64: Disable use of radix under a hypervisor
Currently, if the kernel is running on a POWER9 processor under a
hypervisor, it may try to use the radix MMU even though it doesn't have
the necessary code to do so (it doesn't negotiate use of radix, and it
doesn't do the H_REGISTER_PROC_TBL hcall).  If the hypervisor supports
both radix and HPT, then it will set up the guest to use HPT (since the
guest doesn't request radix in the CAS call), but if the radix feature
bit is set in the ibm,pa-features property (which is valid, since
ibm,pa-features is defined to represent the capabilities of the
processor) the guest will try to use radix, resulting in a crash when
it turns the MMU on.

This makes the minimal fix for the current code, which is to disable
radix unless we are running in hypervisor mode.

Fixes: 2bfd65e45e ("powerpc/mm/radix: Add radix callbacks for early init routines")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-16 14:01:42 +11:00
Aneesh Kumar K.V
18061c17c8 powerpc/mm: Update PROTFAULT handling in the page fault path
With radix, we can get page fault with DSISR_PROTFAULT value set in case of
PROT_NONE or autonuma mapping. The PROT_NONE case in handled by the vma check
where we consider the access bad. For autonuma we should fall through and fixup
the access mask correctly.

Without this patch we trigger the WARN_ON() on radix. This code moves that
WARN_ON() within a radix_enabled() check. I also moved the WARN_ON() outside
the if condition making it apply for all type of faults (exec/write/read). It
is also conditionalized for book3s, because BOOK3E can also get a PROTFAULT to
handle the D/I cache sync.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-15 20:02:39 +11:00
Michael Ellerman
da0e7e6276 Merge branch 'topic/ppc-kvm' into next
Merge the topic branch we're sharing with the kvm-ppc tree.
2017-02-14 17:18:29 +11:00
Michael Ellerman
a05ef161cd powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=y
Currently the build breaks if CMA=n and SPAPR_TCE_IOMMU=y:

  arch/powerpc/mm/mmu_context_iommu.c: In function ‘mm_iommu_get’:
  arch/powerpc/mm/mmu_context_iommu.c:193:42: error: ‘MIGRATE_CMA’ undeclared (first use in this function)
  if (get_pageblock_migratetype(page) == MIGRATE_CMA) {
  ^~~~~~~~~~~

Fix it by using the existing is_migrate_cma_page(), which evaulates to
false when CMA=n.

Fixes: 2e5bbb5461 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-14 16:54:22 +11:00
Shailendra Singh
be9ba9ff93 powerpc: Drop GPL from of_node_to_nid() export to match other arches
The generic implementation of of_node_to_nid() is EXPORT_SYMBOL, added
in commit 298535c00a ("of, numa: Add NUMA of binding
implementation.").

The powerpc implementation added in commit 953039c8df ("[PATCH]
powerpc: Allow devices to register with numa topology") is
EXPORT_SYMBOL_GPL.

This creates an inconsistency for of_node_to_nid() callers across
architectures.

Update the powerpc implementation to be exported consistently with the
generic implementation.

Signed-off-by: Shailendra Singh <shailendras@nvidia.com>
Reviewed-by: Andy Ritger <aritger@nvidia.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-10 13:28:05 +11:00
David Gibson
438cc81a41 powerpc/pseries: Automatically resize HPT for memory hot add/remove
We've now implemented code in the pseries platform to use the new PAPR
interface to allow resizing the hash page table (HPT) at runtime.

This patch uses that interface to automatically attempt to resize the HPT
when memory is hot added or removed.  This tries to always keep the HPT at
a reasonable size for our current memory size.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-10 13:28:02 +11:00
David Gibson
dbcf929c00 powerpc/pseries: Add support for hash table resizing
This adds support for using two hypercalls to change the size of the
main hash page table while running as a PAPR guest. For now these
hypercalls are only in experimental qemu versions.

The interface is two part: first H_RESIZE_HPT_PREPARE is used to
allocate and prepare the new hash table. This may be slow, but can be
done asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the
new hash table. This requires that no CPUs be concurrently updating the
HPT, and so must be run under stop_machine().

This also adds a debugfs file which can be used to manually control
HPT resizing or testing purposes.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Mackerras <paulus@samba.org>
[mpe: Rename the debugfs file to "hpt_order"]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-10 13:27:55 +11:00
Benjamin Herrenschmidt
90c1e3c2fa powerpc/mm/radix: Update ERAT flushes when invalidating TLB
Three tiny changes to the ERAT flushing logic: First don't make
it depend on DD1. It hasn't been decided yet but we might run
DD2 in a mode that also requires explicit flushes for performance
reasons so make it unconditional. We also add a missing isync, and
finally remove the flush from _tlbiel_va as it is only necessary
for congruence-class invalidations (PID, LPID and full TLB), not
targetted invalidations.

Fixes: 96ed1fe511 ("powerpc/mm/radix: Invalidate ERAT on tlbiel for POWER9 DD1")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-09 14:54:33 +11:00
Benjamin Herrenschmidt
d7df2443cd powerpc/mm: Fix spurrious segfaults on radix with autonuma
When autonuma (Automatic NUMA balancing) marks a PTE inaccessible it
clears all the protection bits but leave the PTE valid.

With the Radix MMU, an attempt at executing from such a PTE will
take a fault with bit 35 of SRR1 set "SRR1_ISI_N_OR_G".

It is thus incorrect to treat all such faults as errors. We should
pass them to handle_mm_fault() for autonuma to deal with. The case
of pages that are really not executable is handled by the existing
test for VM_EXEC further down.

That leaves us with catching the kernel attempts at executing user
pages. We can catch that earlier, even before we do find_vma.

It is never valid on powerpc for the kernel to take an exec fault
to begin with. So fold that test with the existing test for the
kernel faulting on kernel addresses to bail out early.

Fixes: 1d18ad0268 ("powerpc/mm: Detect instruction fetch denied and report")
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-08 23:36:29 +11:00
Paul Mackerras
16ed141677 powerpc/64: Make type of partition table flush depend on partition type
When changing a partition table entry on POWER9, we do a particular
form of the tlbie instruction which flushes all TLBs and caches of
the partition table for a given logical partition ID (LPID).
This instruction has a field in the instruction word, labelled R
(radix), which should be 1 if the partition was previously a radix
partition and 0 if it was a HPT partition.  This implements that
logic.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:46 +11:00
Paul Mackerras
ba9b399aee powerpc/64: Export pgtable_cache and pgtable_cache_add for KVM
This exports the pgtable_cache array and the pgtable_cache_add
function so that HV KVM can use them for allocating radix page
tables for guests.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:45 +11:00
Paul Mackerras
cc3d294013 powerpc/64: Enable use of radix MMU under hypervisor on POWER9
To use radix as a guest, we first need to tell the hypervisor via
the ibm,client-architecture call first that we support POWER9 and
architecture v3.00, and that we can do either radix or hash and
that we would like to choose later using an hcall (the
H_REGISTER_PROC_TBL hcall).

Then we need to check whether the hypervisor agreed to us using
radix.  We need to do this very early on in the kernel boot process
before any of the MMU initialization is done.  If the hypervisor
doesn't agree, we can't use radix and therefore clear the radix
MMU feature bit.

Later, when we have set up our process table, which points to the
radix tree for each process, we need to install that using the
H_REGISTER_PROC_TBL hcall.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:44 +11:00
Paul Mackerras
18569c1f13 powerpc/64: Don't try to use radix MMU under a hypervisor
Currently, if the kernel is running on a POWER9 processor under a
hypervisor, it will try to use the radix MMU even though it doesn't have
the necessary code to use radix under a hypervisor (it doesn't negotiate
use of radix, and it doesn't do the H_REGISTER_PROC_TBL hcall). The
result is that the guest kernel will crash when it tries to turn on the
MMU.

This fixes it by looking for the /chosen/ibm,architecture-vec-5
property, and if it exists, clears the radix MMU feature bit, before we
decide whether to initialize for radix or HPT. This property is created
by the hypervisor as a result of the guest calling the
ibm,client-architecture-support method to indicate its capabilities, so
it will indicate whether the hypervisor agreed to us using radix.

Systems without a hypervisor may have this property also (for example,
skiboot creates it), so we check the HV bit in the MSR to see whether we
are running as a guest or not. If we are in hypervisor mode, then we can
do whatever we like including using the radix MMU.

The reason for using this property is that in future, when we have
support for using radix under a hypervisor, we will need to check this
property to see whether the hypervisor agreed to us using radix.

Fixes: 2bfd65e45e ("powerpc/mm/radix: Add radix callbacks for early init routines")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:43 +11:00
Reza Arbab
0d0a4bc2a6 powerpc/mm: unstub radix__vmemmap_remove_mapping()
Use remove_pagetable() and friends for radix vmemmap removal.

We do not require the special-case handling of vmemmap done in the x86
versions of these functions. This is because vmemmap_free() has already
freed the mapped pages, and calls us with an aligned address range.

So, add a few failsafe WARNs, but otherwise the code to remove physical
mappings is already sufficient for vmemmap.

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 13:54:20 +11:00
Reza Arbab
4b5d62ca17 powerpc/mm: add radix__remove_section_mapping()
Tear down and free the four-level page tables of physical mappings
during memory hotremove.

Borrow the basic structure of remove_pagetable() and friends from the
identically-named x86 functions. Reduce the frequency of tlb flushes and
page_table_lock spinlocks by only doing them in the outermost function.
There was some question as to whether the locking is needed at all.
Leave it for now, but we could consider dropping it.

Memory must be offline to be removed, thus not in use. So there
shouldn't be the sort of concurrent page walking activity here that
might prompt us to use RCU.

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 13:54:19 +11:00
Reza Arbab
6cc27341b2 powerpc/mm: add radix__create_section_mapping()
Wire up memory hotplug page mapping for radix. Share the mapping
function already used by radix_init_pgtable().

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 13:54:19 +11:00
Reza Arbab
b5200ec9ed powerpc/mm: refactor radix physical page mapping
Move the page mapping code in radix_init_pgtable() into a separate
function that will also be used for memory hotplug.

The current goto loop progressively decreases its mapping size as it
covers the tail of a range whose end is unaligned. Change this to a for
loop which can do the same for both ends of the range.

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 13:54:18 +11:00
Alistair Popple
1d0761d255 powerpc/powernv: Initialise nest mmu
POWER9 contains an off core mmu called the nest mmu (NMMU). This is
used by other hardware units on the chip to translate virtual
addresses into real addresses. The unit attempting an address
translation provides the majority of the context required for the
translation request except for the base address of the partition table
(ie. the PTCR) which needs to be programmed into the NMMU.

This patch adds a call to OPAL to set the PTCR for the nest mmu in
opal_init().

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30 20:24:33 +11:00
Reza Arbab
2a8628d416 powerpc/mm: Allow memory hotplug into an offline node
Relax the check preventing us from hotplugging into an offline node.

This limitation was added in commit 482ec7c403 ("[PATCH] powerpc numa:
Support sparse online node map") to prevent adding resources to an
uninitialized node.

These days, there is no harm in doing so. The addition will actually
cause the node to be initialized and onlined; add_memory_resource()
calls hotadd_new_pgdat() (if necessary) and node_set_online().

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30 16:49:36 +11:00
Reza Arbab
7656cd8e8e powerpc/mm: Simplify loop control in parse_numa_properties()
The flow of the main loop in parse_numa_properties() is overly
complicated. Simplify it to be less confusing and easier to read.
No functional change.

The end of the main loop in parse_numa_properties() looks like this:

	for_each_node_by_type(...) {
		...
		if (!condition) {
			if (--ranges)
				goto new_range;
			else
				continue;
		}

		statement();

		if (--ranges)
			goto new_range;
		/* else
		 *	continue; <- implicit, this is the end of the loop
		 */
	}

The only effect of !condition is to skip execution of statement(). This
can be rewritten in a simpler way:

	for_each_node_by_type(...) {
		...
		if (condition)
			statement();

		if (--ranges)
			goto new_range;
	}

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30 16:49:30 +11:00
Reza Arbab
a0615a16f7 powerpc/mm: Use the correct pointer when setting a 2MB pte
When setting a 2MB pte, radix__map_kernel_page() is using the address

	ptep = (pte_t *)pudp;

Fix this conversion to use pmdp instead. Use pmdp_ptep() to do this
instead of casting the pointer.

Fixes: 2bfd65e45e ("powerpc/mm/radix: Add radix callbacks for early init routines")
Cc: stable@vger.kernel.org # v4.7+
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30 15:35:13 +11:00
Markus Elfring
a967f161ab powerpc/mm: Return directly after a failed __copy_from_user() in sys_subpage_prot()
This function already has multiple exit points, so there's no harm
adding another. Although it looks odd to return directly in a function
which takes a lock, we've actually just dropped the mmap_sem in this
code, so there's really no reason to go via a label. And it means we can
drop the unhelpfully named out2 label.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[mpe: Rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-25 13:34:22 +11:00
Aneesh Kumar K.V
9859f0c7e0 powerpc/mm: Remove the debug hugepd_ok check
We don't do this for other page table entries. So lets keep this simple
and always return false for hugepd check on a 64K page size config.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-23 19:19:28 +11:00
Aneesh Kumar K.V
20717e1ff5 powerpc/mm: Fix little-endian 4K hugetlb
When we switched to big endian page table, we never updated the hugepd
format such that it can work for both big endian and little endian
config. This patch series update hugepd format such that it is looked at
as __be64 value in big endian page table config.

This patch also switch hugepd_t.pd from signed long to unsigned long.
I did update the FSL hugepd_ok check to check for the top bit instead
of checking > 0.

Fixes: 5dc1ef858c ("powerpc/mm: Use big endian Linux page tables for book3s 64")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-18 11:58:50 +11:00
Aneesh Kumar K.V
ff8b85796d powerpc/mm/hugetlb: Don't panic when we don't find the default huge page size
The generic hugetlbfs code can handle not finding the default huge page
size correctly. With HPAGE_SHIFT = 0 we see in dmesg:

  hugetlbfs: disabling because there are no supported hugepage sizes

bash-4.2# echo 30 > /proc/sys/vm/nr_hugepages
bash: echo: write error: Operation not supported

Fixes: 03bb2d6590 ("powerpc: get hugetlbpage handling more generic")
Reported-by: Chris Smart <chris@distroguy.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-18 11:58:50 +11:00
Nicholas Piggin
bf5ca68dd2 powerpc: Fix pgtable pmd cache init
Commit 9b081e1080 ("powerpc: port 64 bits pgtable_cache to 32 bits")
mixed up PMD_INDEX_SIZE and PMD_CACHE_INDEX a couple of times. This
resulted in 64s/hash/4k configs to panic at boot with a false positive
error check.

Fix that and simplify error handling by moving the check to the caller.

Fixes: 9b081e1080 ("powerpc: port 64 bits pgtable_cache to 32 bits")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-18 11:58:30 +11:00
Reza Arbab
32b53c012e powerpc/mm: Fix memory hotplug BUG() on radix
Memory hotplug is leading to hash page table calls, even on radix:

  arch_add_memory
    create_section_mapping
      htab_bolt_mapping
        BUG_ON(!ppc_md.hpte_insert);

To fix, refactor {create,remove}_section_mapping() into hash__ and
radix__ variants. Leave the radix versions stubbed for now.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-17 10:05:43 +11:00
Linus Torvalds
b272f732f8 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP hotplug notifier removal from Thomas Gleixner:
 "This is the final cleanup of the hotplug notifier infrastructure. The
  series has been reintgrated in the last two days because there came a
  new driver using the old infrastructure via the SCSI tree.

  Summary:

   - convert the last leftover drivers utilizing notifiers

   - fixup for a completely broken hotplug user

   - prevent setup of already used states

   - removal of the notifiers

   - treewide cleanup of hotplug state names

   - consolidation of state space

  There is a sphinx based documentation pending, but that needs review
  from the documentation folks"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/armada-xp: Consolidate hotplug state space
  irqchip/gic: Consolidate hotplug state space
  coresight/etm3/4x: Consolidate hotplug state space
  cpu/hotplug: Cleanup state names
  cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
  staging/lustre/libcfs: Convert to hotplug state machine
  scsi/bnx2i: Convert to hotplug state machine
  scsi/bnx2fc: Convert to hotplug state machine
  cpu/hotplug: Prevent overwriting of callbacks
  x86/msr: Remove bogus cleanup from the error path
  bus: arm-ccn: Prevent hotplug callback leak
  perf/x86/intel/cstate: Prevent hotplug callback leak
  ARM/imx/mmcd: Fix broken cpu hotplug handling
  scsi: qedi: Convert to hotplug state machine
2016-12-25 14:05:56 -08:00
Thomas Gleixner
73c1b41e63 cpu/hotplug: Cleanup state names
When the state names got added a script was used to add the extra argument
to the calls. The script basically converted the state constant to a
string, but the cleanup to convert these strings into meaningful ones did
not happen.

Replace all the useless strings with 'subsys/xxx/yyy:state' strings which
are used in all the other places already.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-25 10:47:44 +01:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds
de399813b5 powerpc updates for 4.10
Highlights include:
 
  - Support for the kexec_file_load() syscall, which is a prereq for secure and
    trusted boot.
 
  - Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN).
 
  - Sort the exception tables at build time, to save time at boot, and store
    them as relative offsets to save space in the kernel image & memory.
 
  - Allow building the kernel with thin archives, which should allow us to build
    an allyesconfig once some other fixes land.
 
  - Build fixes to allow us to correctly rebuild when changing the kernel endian
    from big to little or vice versa.
 
  - Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix.
 
  - Initial stack protector support (-fstack-protector).
 
  - Support for dumping the radix (aka. Linux) and hash page tables via debugfs.
 
  - Fix an oops in cxl coredump generation when cxl_get_fd() is used.
 
  - Freescale updates from Scott: "Highlights include 8xx hugepage support,
    qbman fixes/cleanup, device tree updates, and some misc cleanup."
 
  - Many and varied fixes and minor enhancements as always.
 
 Thanks to:
   Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual,
   Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet,
   Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat,
   Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold,
   Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan
   Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin,
   Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
   Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYU4YSAAoJEFHr6jzI4aWAC4gQALtIAqqPon0Cd5b/FVVcMbW7
 mMqB2b/0FGEl5GoRTzGUDaQqElilm6AEVfHO86C7DFji/a6olneFfw87iz+mtWuZ
 JvrNq68ZiSnoeszdUy4MgtXFLb5sTzNMev4skaHfjI9E5CepWBoR0zH4G+kNVnd5
 WSgudv8Cq4Px+MEuTOigt3QYjHzZ3cw/XNOOm9c+oGj+PDW4O9UItVI+S1WLoey4
 rAB2nRcLMDPuwfRQC9XsF3zEbkv4h1dEXo/EBRuRpcF+0lLTzFw1lv1WE8OxlUmS
 kAXbty3dIytBfSbtJT0c0Ps6sfQ4HFhu6ZV2fjnxNTz2KDkBIN7LBYHmBYiqY9oZ
 9zvbUWtfiTu5ocfRtTq7rC/Hcj4Kbr9S9F/FvXR0WyDsKgu4xxAovqC3gcn6YjYK
 Rr1tcCI4nUzyhVJVmd+OEhUvc5JbFy9aGage+YeOyejfvvSbXIunaxWlPjoDkvim
 Vjl+UKU8gw51XFssqY5ZBi/HNlMFKYedLpMFp/fItnLglhj50V0eFWkpDgdSCYom
 vo9ifPLZx8n8m8De3H7TV4E0F4gCHcTeqZdu7tW9AAUVM6iLJcDLm3asGmtNh21t
 snOHNOJ5QSIno6ezUUg29T6VBjbPh46fdJJSlIZrEe8OzLZ1haGyttf0tD00PQvY
 Z2W/m3gxafnOeGgBqvyv
 =xOzf
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Support for the kexec_file_load() syscall, which is a prereq for
     secure and trusted boot.

   - Prevent kernel execution of userspace on P9 Radix (similar to
     SMEP/PXN).

   - Sort the exception tables at build time, to save time at boot, and
     store them as relative offsets to save space in the kernel image &
     memory.

   - Allow building the kernel with thin archives, which should allow us
     to build an allyesconfig once some other fixes land.

   - Build fixes to allow us to correctly rebuild when changing the
     kernel endian from big to little or vice versa.

   - Plumbing so that we can avoid doing a full mm TLB flush on P9
     Radix.

   - Initial stack protector support (-fstack-protector).

   - Support for dumping the radix (aka. Linux) and hash page tables via
     debugfs.

   - Fix an oops in cxl coredump generation when cxl_get_fd() is used.

   - Freescale updates from Scott: "Highlights include 8xx hugepage
     support, qbman fixes/cleanup, device tree updates, and some misc
     cleanup."

   - Many and varied fixes and minor enhancements as always.

  Thanks to:
    Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman
    Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz,
    Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar
    Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff
    Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin,
    Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N.
    Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica
    Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
    Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain"

[ And thanks to Michael, who took time off from a new baby to get this
  pull request done.   - Linus ]

* tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits)
  powerpc/fsl/dts: add FMan node for t1042d4rdb
  powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb
  powerpc/fsl/dts: add QMan and BMan nodes on t1024
  powerpc/fsl/dts: add QMan and BMan nodes on t1023
  soc/fsl/qman: test: use DEFINE_SPINLOCK()
  powerpc/fsl-lbc: use DEFINE_SPINLOCK()
  powerpc/8xx: Implement support of hugepages
  powerpc: get hugetlbpage handling more generic
  powerpc: port 64 bits pgtable_cache to 32 bits
  powerpc/boot: Request no dynamic linker for boot wrapper
  soc/fsl/bman: Use resource_size instead of computation
  soc/fsl/qe: use builtin_platform_driver
  powerpc/fsl_pmc: use builtin_platform_driver
  powerpc/83xx/suspend: use builtin_platform_driver
  powerpc/ftrace: Fix the comments for ftrace_modify_code
  powerpc/perf: macros for power9 format encoding
  powerpc/perf: power9 raw event format encoding
  powerpc/perf: update attribute_group data structure
  powerpc/perf: factor out the event format field
  powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  ...
2016-12-16 09:26:42 -08:00
Michael Ellerman
c6f6634721 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next
Freescale updates from Scott:

"Highlights include 8xx hugepage support, qbman fixes/cleanup, device
tree updates, and some misc cleanup."
2016-12-16 15:05:38 +11:00
Linus Torvalds
93173b5bf2 Small release, the most interesting stuff is x86 nested virt improvements.
x86: userspace can now hide nested VMX features from guests; nested
 VMX can now run Hyper-V in a guest; support for AVX512_4VNNIW and
 AVX512_FMAPS in KVM; infrastructure support for virtual Intel GPUs.
 
 PPC: support for KVM guests on POWER9; improved support for interrupt
 polling; optimizations and cleanups.
 
 s390: two small optimizations, more stuff is in flight and will be
 in 4.11.
 
 ARM: support for the GICv3 ITS on 32bit platforms.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQExBAABCAAbBQJYTkP0FBxwYm9uemluaUByZWRoYXQuY29tAAoJEL/70l94x66D
 lZIH/iT1n9OQXcuTpYYnQhuCenzI3GZZOIMTbCvK2i5bo0FIJKxVn0EiAAqZSXvO
 nO185FqjOgLuJ1AD1kJuxzye5suuQp4HIPWWgNHcexLuy43WXWKZe0IQlJ4zM2Xf
 u31HakpFmVDD+Cd1qN3yDXtDrRQ79/xQn2kw7CWb8olp+pVqwbceN3IVie9QYU+3
 gCz0qU6As0aQIwq2PyalOe03sO10PZlm4XhsoXgWPG7P18BMRhNLTDqhLhu7A/ry
 qElVMANT7LSNLzlwNdpzdK8rVuKxETwjlc1UP8vSuhrwad4zM2JJ1Exk26nC2NaG
 D0j4tRSyGFIdx6lukZm7HmiSHZ0=
 =mkoB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Small release, the most interesting stuff is x86 nested virt
  improvements.

  x86:
   - userspace can now hide nested VMX features from guests
   - nested VMX can now run Hyper-V in a guest
   - support for AVX512_4VNNIW and AVX512_FMAPS in KVM
   - infrastructure support for virtual Intel GPUs.

  PPC:
   - support for KVM guests on POWER9
   - improved support for interrupt polling
   - optimizations and cleanups.

  s390:
   - two small optimizations, more stuff is in flight and will be in
     4.11.

  ARM:
   - support for the GICv3 ITS on 32bit platforms"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
  arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest
  KVM: arm/arm64: timer: Check for properly initialized timer on init
  KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs
  KVM: x86: Handle the kthread worker using the new API
  KVM: nVMX: invvpid handling improvements
  KVM: nVMX: check host CR3 on vmentry and vmexit
  KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
  KVM: nVMX: propagate errors from prepare_vmcs02
  KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
  KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
  KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
  KVM: nVMX: support restore of VMX capability MSRs
  KVM: nVMX: generate non-true VMX MSRs based on true versions
  KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs.
  KVM: x86: Add kvm_skip_emulated_instruction and use it.
  KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
  KVM: VMX: Reorder some skip_emulated_instruction calls
  KVM: x86: Add a return value to kvm_emulate_cpuid
  KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h
  ...
2016-12-13 15:47:02 -08:00
Reza Arbab
4a3bac4e3a powerpc/mm: allow memory hotplug into a memoryless node
Patch series "enable movable nodes on non-x86 configs", v7.

This patchset allows more configs to make use of movable nodes.  When
CONFIG_MOVABLE_NODE is selected, there are two ways to introduce such
nodes into the system:

1. Discover movable nodes at boot. Currently this is only possible on
   x86, but we will enable configs supporting fdt to do the same.

2. Hotplug and online all of a node's memory using online_movable. This
   is already possible on any config supporting memory hotplug, not
   just x86, but the Kconfig doesn't say so. We will fix that.

We'll also remove some cruft on power which would prevent (2).

This patch (of 5):

Remove the check which prevents us from hotplugging into an empty node.

The original commit b226e46212 ("[PATCH] powerpc: don't add memory to
empty node/zone"), states that this was intended to be a temporary measure.
It is a workaround for an oops which no longer occurs.

Link: http://lkml.kernel.org/r/1479160961-25840-2-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alistair Popple <apopple@au1.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Christophe Leroy
4b91428699 powerpc/8xx: Implement support of hugepages
8xx uses a two level page table with two different linux page size
support (4k and 16k). 8xx also support two different hugepage sizes
512k and 8M. In order to support them on linux we define two different
page table layout.

The size of pages is in the PGD entry, using PS field (bits 28-29):
00 : Small pages (4k or 16k)
01 : 512k pages
10 : reserved
11 : 8M pages

For 512K hugepage size a pgd entry have the below format
[<hugepte address >0101] . The hugepte table allocated will contain 8
entries pointing to 512K huge pte in 4k pages mode and 64 entries in
16k pages mode.

For 8M in 16k mode, a pgd entry have the below format
[<hugepte address >1101] . The hugepte table allocated will contain 8
entries pointing to 8M huge pte.

For 8M in 4k mode, multiple pgd entries point to the same hugepte
address and pgd entry will have the below format
[<hugepte address>1101]. The hugepte table allocated will only have one
entry.

For the time being, we do not support CPU15 ERRATA when HUGETLB is
selected

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> (v3, for the generic bits)
Signed-off-by: Scott Wood <oss@buserror.net>
2016-12-09 22:49:07 -06:00
Christophe Leroy
03bb2d6590 powerpc: get hugetlbpage handling more generic
Today there are two implementations of hugetlbpages which are managed
by exclusive #ifdefs:
* FSL_BOOKE: several directory entries points to the same single hugepage
* BOOK3S: one upper level directory entry points to a table of hugepages

In preparation of implementation of hugepage support on the 8xx, we
need a mix of the two above solutions, because the 8xx needs both cases
depending on the size of pages:
* In 4k page size mode, each PGD entry covers a 4M bytes area. It means
that 2 PGD entries will be necessary to cover an 8M hugepage while a
single PGD entry will cover 8x 512k hugepages.
* In 16 page size mode, each PGD entry covers a 64M bytes area. It means
that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
hugepages will be covers by one PGD entry.

This patch:
* removes #ifdefs in favor of if/else based on the range sizes
* merges the two huge_pte_alloc() functions as they are pretty similar
* merges the two hugetlbpage_init() functions as they are pretty similar

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> (v3)
Signed-off-by: Scott Wood <oss@buserror.net>
2016-12-09 22:48:09 -06:00
Christophe Leroy
9b081e1080 powerpc: port 64 bits pgtable_cache to 32 bits
Today powerpc64 uses a set of pgtable_caches while powerpc32 uses
standard pages when using 4k pages and a single pgtable_cache
if using other size pages.

In preparation of implementing huge pages on the 8xx, this patch
replaces the specific powerpc32 handling by the 64 bits approach.

This is done by:
* moving 64 bits pgtable_cache_add() and pgtable_cache_init()
in a new file called init-common.c
* modifying pgtable_cache_init() to also handle the case
without PMD
* removing the 32 bits version of pgtable_cache_add() and
pgtable_cache_init()
* copying related header contents from 64 bits into both the
book3s/32 and nohash/32 header files

On the 8xx, the following cache sizes will be used:
* 4k pages mode:
- PGT_CACHE(10) for PGD
- PGT_CACHE(3) for 512k hugepage tables
* 16k pages mode:
- PGT_CACHE(6) for PGD
- PGT_CACHE(7) for 512k hugepage tables
- PGT_CACHE(3) for 8M hugepage tables

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-12-09 22:48:01 -06:00
Alexey Kardashevskiy
4b6fad7097 powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.

This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:34 +11:00
Alexey Kardashevskiy
d7baee6901 powerpc/iommu: Stop using @current in mm_iommu_xxx
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.

This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.

This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:29 +11:00
Alexey Kardashevskiy
88f54a3581 powerpc/iommu: Pass mm_struct to init/cleanup helpers
We are going to get rid of @current references in mmu_context_boos3s64.c
and cache mm_struct in the VFIO container. Since mm_context_t does not
have reference counting, we will be using mm_struct which does have
the reference counter.

This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
than mm_context_t (which is embedded into mm).

This should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:27 +11:00
Michael Ellerman
dd5ac03e09 powerpc/mm: Fix page table dump build on non-Book3S
In the recent commit 1515ab9321 ("powerpc/mm: Dump hash table") we
added code to dump the hage page table. Currently this can be selected
to build on any platform. However it breaks the build if we're building
for a non-Book3S platform, because none of the hash page table related
defines and so on exist. So restrict it to building only on Book3S.

Similarly in commit 8eb07b1870 ("powerpc/mm: Dump linux pagetables")
we added code to dump the Linux page tables, which uses some constants
which are only defined on Book3S - so guard those with an #ifdef.

Fixes: 1515ab9321 ("powerpc/mm: Dump hash table")
Fixes: 8eb07b1870 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-01 16:20:18 +11:00
Balbir Singh
0ab5171b89 powerpc/mm: Fix no execute fault handling on pre-POWER5
Aneesh/Ben reported that the change to do_page_fault() we made in commit
1d18ad0268 ("powerpc/mm: Detect instruction fetch denied and report")
needs to handle the case where CPU_FTR_COHERENT_ICACHE is missing but we
have CPU_FTR_NOEXECUTE. In those cases the check added for
SRR1_ISI_N_OR_G might trigger a false positive.

This patch adds a check for CPU_FTR_COHERENT_ICACHE in addition to the
MSR value.

Fixes: 1d18ad0268 ("powerpc/mm: Detect instruction fetch denied and report")
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-30 17:19:01 +11:00
Benjamin Herrenschmidt
dd7b2f035e powerpc/mm: Fix lazy icache flush on pre-POWER5
On 64-bit CPUs with no-execute support and non-snooping icache, such as
970 or POWER4, we have a software mechanism to ensure coherency of the
cache (using exec faults when needed).

This was broken due to a logic error when the code was rewritten
from assembly to C, previously the assembly code did:

  BEGIN_FTR_SECTION
         mr      r4,r30
         mr      r5,r7
         bl      hash_page_do_lazy_icache
  END_FTR_SECTION(CPU_FTR_NOEXECUTE|CPU_FTR_COHERENT_ICACHE, CPU_FTR_NOEXECUTE)

Which tests that:
   (cpu_features & (NOEXECUTE | COHERENT_ICACHE)) == NOEXECUTE

Which says that the current cpu does have NOEXECUTE, but does not have
COHERENT_ICACHE.

Fixes: 91f1da9979 ("powerpc/mm: Convert 4k hash insert to C")
Fixes: 89ff725051 ("powerpc/mm: Convert __hash_page_64K to C")
Fixes: a43c0eb836 ("powerpc/mm: Convert 4k insert from asm to C")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Change log verbosification]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-29 23:59:40 +11:00
Aneesh Kumar K.V
b3603e174f powerpc/mm: update radix__ptep_set_access_flag to not do full mm tlb flush
When we are updating a pte, we just need to flush the tlb mapping
that pte. Right now we do a full mm flush because we don't track the page
size. Now that we have page size details in pte use that to do the
optimized flush

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-28 22:44:33 +11:00
Aneesh Kumar K.V
6d3a0379eb powerpc/mm: Add radix__tlb_flush_pte_p9_dd1()
Now that we have page size details encoded in pte using software pte
bits, use that to find the page size needed for tlb flush.

This function should only be used on P9 DD1, so give it a horrible name
to make that clear.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-28 22:43:45 +11:00
Balbir Singh
3b10d0095a powerpc/mm/radix: Prevent kernel execution of user space
ISA 3 defines new encoded access authority that allows instruction
access prevention in privileged mode and allows normal access
to problem state. This patch just enables IAMR (Instruction Authority
Mask Register), enabling AMR would require more work.

I've tested this with a buggy driver and a simple payload. The payload
is specific to the build I've tested.

mpe: Also tested with LKDTM:

  # echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT
  lkdtm: Performing direct entry EXEC_USERSPACE
  lkdtm: attempting ok execution at c0000000005bf560
  lkdtm: attempting bad execution at 00003fff8d940000
  Unable to handle kernel paging request for instruction fetch
  Faulting instruction address: 0x3fff8d940000
  Oops: Kernel access of bad area, sig: 11 [#1]
  NIP: 00003fff8d940000 LR: c0000000005bfa58 CTR: 00003fff8d940000
  REGS: c0000000f1fcf900 TRAP: 0400   Not tainted  (4.9.0-rc5-compiler_gcc-6.2.0-00109-g956dbc06232a)
  MSR: 9000000010009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48002222  XER: 00000000
  ...
  Call Trace:
    lkdtm_EXEC_USERSPACE+0x104/0x120 (unreliable)
    lkdtm_do_action+0x3c/0x80
    direct_entry+0x100/0x1b0
    full_proxy_write+0x94/0x100
    __vfs_write+0x3c/0x1b0
    vfs_write+0xcc/0x230
    SyS_write+0x60/0x110
    system_call+0x38/0xfc

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-26 18:48:04 +11:00
Balbir Singh
1d18ad0268 powerpc/mm: Detect instruction fetch denied and report
ISA 3 allows for prevention of instruction fetch and execution
of user mode pages. If such an error occurs, SRR1 bit 35 reports the
error. We catch and report the error in do_page_fault().

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 15:01:35 +11:00
Balbir Singh
ee97b6b99f powerpc/mm/radix: Setup AMOR in HV mode to allow key 0
Setup AMOR (Authority Mask Override Register) in HV mode so that the
host and guest kernel can in turn setup IAMR.

This allows us to enable key 0 in a following patch.

Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 15:01:31 +11:00
Aneesh Kumar K.V
984d7a1ec6 powerpc/mm: Fixup kernel read only mapping
With commit e58e87adc8 ("powerpc/mm: Update _PAGE_KERNEL_RO") we
started using the ppp value 0b110 to map kernel readonly. But that
facility was only added as part of ISA 2.04. For earlier ISA version
only supported ppp bit value for readonly mapping is 0b011. (This
implies both user and kernel get mapped using the same ppp bit value for
readonly mapping.).
Update the code such that for earlier architecture version we use ppp
value 0b011 for readonly mapping. We don't differentiate between power5+
and power5 here and apply the new ppp bits only from power6 (ISA 2.05).
This keep the changes minimal.

This fixes issue with PS3 spu usage reported at
https://lkml.kernel.org/r/rep.1421449714.geoff@infradead.org

Fixes: e58e87adc8 ("powerpc/mm: Update _PAGE_KERNEL_RO")
Cc: stable@vger.kernel.org # v4.7+
Tested-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 14:18:25 +11:00
Michael Ellerman
ddbefe7e77 Merge branch 'topic/ppc-kvm' into next
Merge the topic branch we're sharing with the kvm-ppc tree.
2016-11-24 22:14:52 +11:00
Paul Mackerras
9d66195807 powerpc/64: Provide functions for accessing POWER9 partition table
POWER9 requires the host to set up a partition table, which is a
table in memory indexed by logical partition ID (LPID) which
contains the pointers to page tables and process tables for the
host and each guest.

This factors out the initialization of the partition table into
a single function.  This code was previously duplicated between
hash_utils_64.c and pgtable-radix.c.

This provides a function for setting a partition table entry,
which is used in early MMU initialization, and will be used by
KVM whenever a guest is created.  This function includes a tlbie
instruction which will flush all TLB entries for the LPID and
all caches of the partition table entry for the LPID, across the
system.

This also moves a call to memblock_set_current_limit(), which was
in radix_init_partition_table(), but has nothing to do with the
partition table.  By analogy with the similar code for hash, the
call gets moved to near the end of radix__early_init_mmu().  It
now gets called when running as a guest, whereas previously it
would only be called if the kernel is running as the host.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-23 10:32:11 +11:00
Aneesh Kumar K.V
64168f4296 powerpc/mm/coproc: Handle bad address on coproc slb fault
VSID 0 is bad address. Don't create slb entries on coproc fault for
bad address

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-22 11:57:08 +11:00
Aneesh Kumar K.V
cac4a18540 powerpc/mm: Fix missing update of HID register on secondary CPUs
We need to update on secondaries for the selected MMU mode.

Fixes: ad410674f5 ("powerpc/mm: Update the HID bit when switching from radix to hash")
Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-18 23:16:58 +11:00
Michael Neuling
96ed1fe511 powerpc/mm/radix: Invalidate ERAT on tlbiel for POWER9 DD1
On POWER9 DD1, when we do a local TLB invalidate we also need to explicitly
invalidate the ERAT.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-18 15:12:24 +11:00
Suraj Jitindar Singh
555c16328a powerpc/mm: Correct process and partition table max size
Version 3.00 of the ISA states that the PATS (partition table size) field
of the PTCR (partition table control register) and the PRTS (process table
size) field of the partition table entry must both be less than or equal
to 24. However the actual size of the partition and process tables is equal
to 2 to the power of 12 plus the PATS and PRTS fields, respectively. This
means that the max allowable size of each of these tables is 2^36 or 64GB
for both.

Thus when checking the size shift for each we should be checking for values
of greater than 36 instead of the current check for shifts larger than 24
and 23.

Fixes: 2bfd65e45e
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-17 17:11:53 +11:00
Rashmica Gupta
1515ab9321 powerpc/mm: Dump hash table
Useful to be able to dump the kernel hash page table to check
which pages are hashed along with their sizes and other details.

Add a debugfs file to check the hash page table. If radix is enabled
(and so there is no hash page table) then this file doesn't exist. To
use this the PPC_PTDUMP config option must be selected.

Signed-off-by: Rashmica Gupta <rashmicy@gmail.com>
[mpe: Fix build with SPARSEMEM_VMEMMAP=n & PSERIES=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-17 17:11:47 +11:00
Rashmica Gupta
8eb07b1870 powerpc/mm: Dump linux pagetables
Useful to be able to dump the kernels page tables to check permissions
and memory types - derived from arm64's implementation.

Add a debugfs file to check the page tables. To use this the PPC_PTDUMP
config option must be selected.

Signed-off-by: Rashmica Gupta <rashmicy@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-17 17:11:46 +11:00
Balbir Singh
ac8d3818aa powerpc/mm: Fix typo in radix encodings print
Rename "sift" to "shift".

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-17 17:11:45 +11:00
Paul Mackerras
6b243fcfb5 powerpc/64: Simplify adaptation to new ISA v3.00 HPTE format
This changes the way that we support the new ISA v3.00 HPTE format.
Instead of adapting everything that uses HPTE values to handle either
the old format or the new format, depending on which CPU we are on,
we now convert explicitly between old and new formats if necessary
in the low-level routines that actually access HPTEs in memory.
This limits the amount of code that needs to know about the new
format and makes the conversions explicit.  This is OK because the
old format contains all the information that is in the new format.

This also fixes operation under a hypervisor, because the H_ENTER
hypercall (and other hypercalls that deal with HPTEs) will continue
to require the HPTE value to be supplied in the old format.  At
present the kernel will not boot in HPT mode on POWER9 under a
hypervisor.

This fixes and partially reverts commit 50de596de8
("powerpc/mm/hash: Add support for Power9 Hash", 2016-04-29).

Fixes: 50de596de8 ("powerpc/mm/hash: Add support for Power9 Hash")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-16 15:43:25 +11:00
Balbir Singh
f923efbcfd powerpc/hash64: Be more careful when generating tlbiel
In ISA v2.05, the tlbiel instruction takes two arguments, RB and L:

tlbiel RB,L

+---------+---------+----+---------+---------+---------+----+
|    31   |    /    | L  |    /    |    RB   |   274   | /  |
| 31 - 26 | 25 - 22 | 21 | 20 - 16 | 15 - 11 |  10 - 1 | 0  |
+---------+---------+----+---------+---------+---------+----+

In ISA v2.06 tlbiel takes only one argument, RB:

tlbiel RB

+---------+---------+---------+---------+---------+----+
|    31   |    /    |    /    |    RB   |   274   | /  |
| 31 - 26 | 25 - 21 | 20 - 16 | 15 - 11 |  10 - 1 | 0  |
+---------+---------+---------+---------+---------+----+

And in ISA v3.00 tlbiel takes five arguments:

tlbiel RB,RS,RIC,PRS,R

+---------+---------+----+---------+----+----+---------+---------+----+
|    31   |    RS   | /  |   RIC   |PRS | R  |    RB   |   274   | /  |
| 31 - 26 | 25 - 21 | 20 | 19 - 18 | 17 | 16 | 15 - 11 |  10 - 1 | 0  |
+---------+---------+----+---------+----+----+---------+---------+----+

However the assembler also accepts "tlbiel RB", and generates
"tlbiel RB,r0,0,0,0".

As you can see above the L field from the v2.05 encoding overlaps with the
reserved field of the v2.06 encoding, and the low bit of the RS field of the
v3.00 encoding.

Currently in __tlbiel() we generate two tlbiel instructions manually using hex
constants. In the first case, for MMU_PAGE_4K, we generate "tlbiel RB,0", which
is safe in all cases, because the L bit is zero.

However in the default case we generate "tlbiel RB,1", therefore setting bit 21
to 1.

This is not an actual bug on v2.06 processors, because the CPU ignores the value
of the reserved field. However software is supposed to encode the reserved
fields as zero to enable forward compatibility.

On v3.00 processors setting bit 21 to 1 and no other bits of RS, means we are
using r1 for the value of RS.

Although it's not obvious, the code sets the IS field (bits 10-11) to 0 (by
omission), and L=1, in the va value, which is passed as RB. We also pass R=0 in
the instruction.

The combination of IS=0, L=1 and R=0 means the value of RS is not used, so even
on ISA v3.00 there is no actual bug.

We should still fix it, as setting a reserved bit on v2.06 is naughty, and we
are only avoiding a bug on v3.00 by accident rather than design. Use
ASM_FTR_IFSET() to generate the single argument form on ISA v2.06 and later, and
the two argument form on pre v2.06.

Although there may be very old toolchains which don't understand tlbiel, we have
other code in the tree which has been using tlbiel for over five years, and no
one has reported any build failures, so just let the assembler generate the
instructions.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Rewrite change log, use IFSET instead of IFCLR]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-14 11:11:51 +11:00
Nicholas Piggin
61a92f7031 powerpc: Add support for relative exception tables
This halves the exception table size on 64-bit builds, and it allows
build-time sorting of exception tables to work on relocated kernels.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Minor asm fixups and bits to keep the selftests working]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-14 11:11:51 +11:00
Aneesh Kumar K.V
bd77c44986 powerpc/mm/radix: Use tlbiel only if we ever ran on the current cpu
Before this patch, we used tlbiel, if we ever ran only on this core.
That was mostly derived from the nohash usage of the same. But is
incorrect, the ISA 3.0 clarifies tlbiel such that:

"All TLB entries that have all of the following properties are made
invalid on the thread executing the tlbiel instruction"

ie. tlbiel only invalidates TLB entries on the current thread. So if the
mm has been used on any other thread (aka. cpu) then we must broadcast
the invalidate.

This bug could lead to invalid TLB entries if a program runs on multiple
threads of a core.

Hence use tlbiel, if we only ever ran on only the current cpu.

Fixes: 1a472c9dba ("powerpc/mm/radix: Add tlbflush routines")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-27 21:55:13 +11:00
Aneesh Kumar K.V
8467801cc8 powerpc: Fix numa topology console print
With recent update to printk, we get console output like below:

[    0.550639] Brought up 160 CPUs
[    0.550718] Node 0 CPUs:
[    0.550721]  0
[    0.550754] -39

[    0.550794] Node 1 CPUs:
[    0.550798]  40
[    0.550817] -79

[    0.550856] Node 16 CPUs:
[    0.550860]  80
[    0.550880] -119

[    0.550917] Node 17 CPUs:
[    0.550923]  120
[    0.550942] -159

Fix this by properly using pr_cont(), ie. KERN_CONT.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-19 20:35:41 +11:00
Michael Ellerman
08b5e79ebd powerpc/mm: Drop dump_numa_memory_topology()
At boot we dump the NUMA memory topology in dump_numa_memory_topology(),
at KERN_DEBUG level, resulting in output like:

  Node 0 Memory: 0x0-0x100000000
  Node 1 Memory: 0x100000000-0x200000000

Which is nice enough, but immediately after that we iterate over each
node and call setup_node_data(), which also prints out the node ranges,
at KERN_INFO, giving eg:

  numa: Initmem setup node 0 [mem 0x00000000-0xffffffff]
  numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff]

Additionally dump_numa_memory_topology() does not use KERN_CONT
correctly, resulting in split output lines on recent kernels.

So drop dump_numa_memory_topology() as superfluous chatter.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-19 20:35:40 +11:00
Frederic Barrat
d2cf909cda powerpc/mm: Prevent unlikely crash in copro_calculate_slb()
If a cxl adapter faults on an invalid address for a kernel context, we
may enter copro_calculate_slb() with a NULL mm pointer (kernel
context) and an effective address which looks like a user
address. Which will cause a crash when dereferencing mm. It is clearly
an AFU bug, but there's no reason to crash either. So return an error,
so that cxl can ack the interrupt with an address error.

Fixes: 73d16a6e0e ("powerpc/cell: Move data segment faulting code out of cell platform")
Cc: stable@vger.kernel.org # v3.18+
Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-19 20:32:49 +11:00
Linus Torvalds
84d69848c9 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild updates from Michal Marek:

 - EXPORT_SYMBOL for asm source by Al Viro.

   This does bring a regression, because genksyms no longer generates
   checksums for these symbols (CONFIG_MODVERSIONS). Nick Piggin is
   working on a patch to fix this.

   Plus, we are talking about functions like strcpy(), which rarely
   change prototypes.

 - Fixes for PPC fallout of the above by Stephen Rothwell and Nick
   Piggin

 - fixdep speedup by Alexey Dobriyan.

 - preparatory work by Nick Piggin to allow architectures to build with
   -ffunction-sections, -fdata-sections and --gc-sections

 - CONFIG_THIN_ARCHIVES support by Stephen Rothwell

 - fix for filenames with colons in the initramfs source by me.

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (22 commits)
  initramfs: Escape colons in depfile
  ppc: there is no clear_pages to export
  powerpc/64: whitelist unresolved modversions CRCs
  kbuild: -ffunction-sections fix for archs with conflicting sections
  kbuild: add arch specific post-link Makefile
  kbuild: allow archs to select link dead code/data elimination
  kbuild: allow architectures to use thin archives instead of ld -r
  kbuild: Regenerate genksyms lexer
  kbuild: genksyms fix for typeof handling
  fixdep: faster CONFIG_ search
  ia64: move exports to definitions
  sparc32: debride memcpy.S a bit
  [sparc] unify 32bit and 64bit string.h
  sparc: move exports to definitions
  ppc: move exports to definitions
  arm: move exports to definitions
  s390: move exports to definitions
  m68k: move exports to definitions
  alpha: move exports to actual definitions
  x86: move exports to actual definitions
  ...
2016-10-14 14:26:58 -07:00
Linus Torvalds
d8bfb96a2e powerpc updates for 4.9 #2
Freescale updates from Scott:
 
 "Highlights include qbman support (a prerequisite for datapath drivers
 such as ethernet), a PCI DMA fix+improvement, reset handler changes, more
 8xx optimizations, and some cleanups and fixes."
 
 Fixes:
  - selftests/powerpc: Add missing binaries to .gitignores (Michael Ellerman)
  - selftests/powerpc: Fix build break caused by EXPORT_SYMBOL changes (Michael Ellerman)
  - powerpc/pseries: Fix stack corruption in htpe code (Laurent Dufour)
  - powerpc/64s: Fix power4_fixup_nap placement (Nicholas Piggin)
  - powerpc/64: Fix incorrect return value from __copy_tofrom_user (Paul Mackerras)
  - powerpc/mm/hash64: Fix might_have_hea() check (Michael Ellerman)
 
 Other:
  - MAINTAINERS: Remove myself from PA Semi entries (Olof Johansson)
  - MAINTAINERS: Drop separate pseries entry (Michael Ellerman)
  - MAINTAINERS: Update powerpc website & add selftests (Michael Ellerman)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJX/1EOAAoJEFHr6jzI4aWAu0UQAIsmnc361a4xOl1ODRzJNWSD
 OqBbuGQPZ3I3XPMxB6BBK4mnpR507nb+L1yxMDidDebJal/pRJdXkGi7I5pe9uq+
 12XJcaePtVfmrHKKWAhC/fef0gQHSusBDpIIDquN5QE1BVvUDGbynG4GjnpX9ZaT
 gmXGL03u/yJvUoUNexG7lrMAJ7bZgU8BzFKyojzWtoEDF4SM7rpWKs1hGwojW4/T
 EYcek5uTNo01UsN/WNrtBkHA8eC9unnLk9NisOxvBXu7eJfEq38Bz71fhoowFO+C
 FDRboPdkXxySzzNTBb3hROontLZS2S13upzjcrRo2/f4gxvcimRJtDzxuRKrYX5n
 xdXcZVdFSRsKanbuV0Dwjki05IU4zeOhsHUqYqaS2UD+QlAbNCu0N9DZOhPMn+H2
 8uT3cOOrBLBrhIH3e7DMK9Rx97FBeuCvwrbjnZp8My7s55VXXd2CZTFYf5/wW0b2
 VEf5eoXM1BB2zuh9kFZ785Sq5iYnsKoNhKjoXULkBrf3m7WtmjPIbHzRTJM5ltwt
 YUvFMG6nncQB0ERVOvDIXXNzwVB0JkJTVX2BBZ2a7Fr+8KHE6rTYkgcQiosibUmq
 gLV9M59MFamAgJlna3A1OmGIpEiZ3RrriYL2mgraWwuLUn/qW3yPiPE6hBk0uomL
 cARvlIjGf8rWhi+3qAb1
 =lwsd
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull more powerpc updates from Michael Ellerman:
 "Some more powerpc updates for 4.9:

  Freescale updates from Scott Wood:
   - qbman support (a prerequisite for datapath drivers such as ethernet)
   - a PCI DMA fix+improvement
   - reset handler changes
   - more 8xx optimizations
   - some cleanups and fixes.'

  Fixes:
   - selftests/powerpc: Add missing binaries to .gitignores (Michael Ellerman)
   - selftests/powerpc: Fix build break caused by EXPORT_SYMBOL changes (Michael Ellerman)
   - powerpc/pseries: Fix stack corruption in htpe code (Laurent Dufour)
   - powerpc/64s: Fix power4_fixup_nap placement (Nicholas Piggin)
   - powerpc/64: Fix incorrect return value from __copy_tofrom_user (Paul Mackerras)
   - powerpc/mm/hash64: Fix might_have_hea() check (Michael Ellerman)

  Other:
   - MAINTAINERS: Remove myself from PA Semi entries (Olof Johansson)
   - MAINTAINERS: Drop separate pseries entry (Michael Ellerman)
   - MAINTAINERS: Update powerpc website & add selftests (Michael Ellerman):

* tag 'powerpc-4.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (35 commits)
  powerpc/mm/hash64: Fix might_have_hea() check
  powerpc/64: Fix incorrect return value from __copy_tofrom_user
  powerpc/64s: Fix power4_fixup_nap placement
  powerpc/pseries: Fix stack corruption in htpe code
  selftests/powerpc: Fix build break caused by EXPORT_SYMBOL changes
  MAINTAINERS: Update powerpc website & add selftests
  MAINTAINERS: Drop separate pseries entry
  MAINTAINERS: Remove myself from PA Semi entries
  selftests/powerpc: Add missing binaries to .gitignores
  arch/powerpc: Add CONFIG_FSL_DPAA to corenetXX_smp_defconfig
  soc/qman: Add self-test for QMan driver
  soc/bman: Add self-test for BMan driver
  soc/fsl: Introduce DPAA 1.x QMan device driver
  soc/fsl: Introduce DPAA 1.x BMan device driver
  powerpc/8xx: make user addr DTLB miss the short path
  powerpc/8xx: Move additional DTLBMiss handlers out of exception area
  powerpc/8xx: use r3 to scratch CR in ITLBmiss
  soc/fsl/qe: fix gpio save_regs functions
  powerpc/8xx: add dedicated machine check handler
  powerpc/8xx: add system_reset_exception
  ...
2016-10-14 11:07:42 -07:00
Michael Ellerman
08bf75ba85 powerpc/mm/hash64: Fix might_have_hea() check
In commit 2b4e3ad8f5 ("powerpc/mm/hash64: Don't test for machine type
to detect HEA special case") we changed the logic in might_have_hea()
to check FW_FEATURE_SPLPAR rather than machine_is(pseries).

However the check was incorrectly negated, leading to crashes on
machines with HEA adapters, such as:

  mm: Hashing failure ! EA=0xd000080080004040 access=0x800000000000000e current=NetworkManager
      trap=0x300 vsid=0x13d349c ssize=1 base psize=2 psize 2 pte=0xc0003cc033e701ae
  Unable to handle kernel paging request for data at address 0xd000080080004040
  Call Trace:
    .ehea_create_cq+0x148/0x340 [ehea] (unreliable)
    .ehea_up+0x258/0x1200 [ehea]
    .ehea_open+0x44/0x1a0 [ehea]
    ...

Fix it by removing the negation.

Fixes: 2b4e3ad8f5 ("powerpc/mm/hash64: Don't test for machine type to detect HEA special case")
Cc: stable@vger.kernel.org # v4.8+
Reported-by: Denis Kirjanov <kda@linux-powerpc.org>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-12 08:32:28 +11:00
Linus Torvalds
07021b4359 powerpc updates for 4.9
Highlights:
  - Major rework of Book3S 64-bit exception vectors (Nicholas Piggin)
    - Use gas sections for arranging exception vectors et. al.
  - Large set of TM cleanups and selftests (Cyril Bur)
  - Enable transactional memory (TM) lazily for userspace (Cyril Bur)
  - Support for XZ compression in the zImage wrapper (Oliver O'Halloran)
  - Add support for bpf constant blinding (Naveen N. Rao)
  - Beginnings of upstream support for PA Semi Nemo motherboards (Darren Stevens)
 
 Fixes:
  - Ensure .mem(init|exit).text are within _stext/_etext (Michael Ellerman)
  - xmon: Don't use ld on 32-bit (Michael Ellerman)
  - vdso64: Use double word compare on pointers (Anton Blanchard)
  - powerpc/nvram: Fix an incorrect partition merge (Pan Xinhui)
  - powerpc: Fix usage of _PAGE_RO in hugepage (Christophe Leroy)
  - powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K (Aneesh Kumar K.V)
  - Fix memory leak in queue_hotplug_event() error path (Andrew Donnellan)
  - Replay hypervisor maintenance interrupt first (Nicholas Piggin)
 
 Cleanups & features:
  - Sparse fixes/cleanups (Daniel Axtens)
  - Preserve CFAR value on SLB miss caused by access to bogus address (Paul Mackerras)
  - Radix MMU fixups for POWER9 (Aneesh Kumar K.V)
  - Support for setting used_(vsr|vr|spe) in sigreturn path (for CRIU) (Simon Guo)
  - Optimise syscall entry for virtual, relocatable case (Nicholas Piggin)
  - Optimise MSR handling in exception handling (Nicholas Piggin)
  - Support for kexec with Radix MMU (Benjamin Herrenschmidt)
  - powernv EEH fixes (Russell Currey)
  - Suprise PCI hotplug support for powernv (Gavin Shan)
  - Endian/sparse fixes for powernv PCI (Gavin Shan)
  - Defconfig updates (Anton Blanchard)
  - Various performance optimisations (Anton Blanchard)
    - Align hot loops of memset() and backwards_memcpy()
    - During context switch, check before setting mm_cpumask
    - Remove static branch prediction in atomic{, 64}_add_unless
    - Only disable HAVE_EFFICIENT_UNALIGNED_ACCESS on POWER7 little endian
    - Set default CPU type to POWER8 for little endian builds
 
  - KVM: PPC: Book3S HV: Migrate pinned pages out of CMA (Balbir Singh)
  - cxl: Flush PSL cache before resetting the adapter (Frederic Barrat)
  - cxl: replace loop with for_each_child_of_node(), remove unneeded of_node_put() (Andrew Donnellan)
  - Fix HV facility unavailable to use correct handler (Nicholas Piggin)
  - Remove unnecessary syscall trampoline (Nicholas Piggin)
  - fadump: Fix build break when CONFIG_PROC_VMCORE=n (Michael Ellerman)
  - Quieten EEH message when no adapters are found (Anton Blanchard)
  - powernv: Add PHB register dump debugfs handle (Russell Currey)
  - Use kprobe blacklist for exception handlers & asm functions (Nicholas Piggin)
  - Document the syscall ABI (Nicholas Piggin)
  - MAINTAINERS: Update cxl maintainers (Michael Neuling)
  - powerpc: Remove all usages of NO_IRQ (Michael Ellerman)
 
 Minor cleanups:
  - Andrew Donnellan, Christophe Leroy, Colin Ian King, Cyril Bur, Frederic Barrat,
    Pan Xinhui, PrasannaKumar Muralidharan, Rui Teng, Simon Guo.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJX9x5ZAAoJEFHr6jzI4aWAWQ0P+gOhdtayMsRY0k0dzPmYaFr0
 Ha5v968RJaNIyGGM9ARJg8h27PGMaSlBp/9zaYdk1G7xfv/DMR0uq8d8l5pjy/Zw
 Jm72WE4PEX/zAcQxry6Y2fDdumO09crTBA/W0hM1UZzqu0bcVUfD+E51ZFYWW7yh
 fyhT2YnlucxIcT34pxsLqwTIiZYG4xgN3+YGo0wohY1D1GHE3UZ7SXIglb49yM6v
 ZeXrL7SOdERR1w88rC+g99P/cWng5HDS0wPLUbxGT5KIpoOSXOs7EbZwFqQBUy5O
 37PB07K5dDyUbrm++l5lUigldF3W1OZQBN5+n8PciulxxwFX84pllTlAxv1p60JR
 piEKZ8pl023IF7zMGatUG9qcNOcnbxdMsAhoEhlcFi9ulM/yLzbmRTKVfDYm+O/J
 UI+YtcbsgdyOXMdGXCqdpeBNuuypgLG/g7gC8bnk3taS0LUUZLcXtRNuE4tcPJJe
 v8FnszaLkjAi83Lmzt3fgZo7DI1RIPwDSw6fY+nBrxCRfEPRVx3f7KhmUXvSeol5
 Ln9xpk4AtyQt1RHhckxXwWSUgvXVg2ltmz7ElqK4sQ9mO/D2ZIs6R6fPY4VlJLc4
 /2yIV4RLIsbHmdv9IbJ8PBp0VTugSNdicZ904QiAHSZQv/i1mgYuXw3tjR6kuy9f
 bKOzNJTwLV1WUsOlUpiq
 =Jnn8
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights:
   - Major rework of Book3S 64-bit exception vectors (Nicholas Piggin)
   - Use gas sections for arranging exception vectors et. al.
   - Large set of TM cleanups and selftests (Cyril Bur)
   - Enable transactional memory (TM) lazily for userspace (Cyril Bur)
   - Support for XZ compression in the zImage wrapper (Oliver
     O'Halloran)
   - Add support for bpf constant blinding (Naveen N. Rao)
   - Beginnings of upstream support for PA Semi Nemo motherboards
     (Darren Stevens)

  Fixes:
   - Ensure .mem(init|exit).text are within _stext/_etext (Michael
     Ellerman)
   - xmon: Don't use ld on 32-bit (Michael Ellerman)
   - vdso64: Use double word compare on pointers (Anton Blanchard)
   - powerpc/nvram: Fix an incorrect partition merge (Pan Xinhui)
   - powerpc: Fix usage of _PAGE_RO in hugepage (Christophe Leroy)
   - powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K
     (Aneesh Kumar K.V)
   - Fix memory leak in queue_hotplug_event() error path (Andrew
     Donnellan)
   - Replay hypervisor maintenance interrupt first (Nicholas Piggin)

  Various performance optimisations (Anton Blanchard):
   - Align hot loops of memset() and backwards_memcpy()
   - During context switch, check before setting mm_cpumask
   - Remove static branch prediction in atomic{, 64}_add_unless
   - Only disable HAVE_EFFICIENT_UNALIGNED_ACCESS on POWER7 little
     endian
   - Set default CPU type to POWER8 for little endian builds

  Cleanups & features:
   - Sparse fixes/cleanups (Daniel Axtens)
   - Preserve CFAR value on SLB miss caused by access to bogus address
     (Paul Mackerras)
   - Radix MMU fixups for POWER9 (Aneesh Kumar K.V)
   - Support for setting used_(vsr|vr|spe) in sigreturn path (for CRIU)
     (Simon Guo)
   - Optimise syscall entry for virtual, relocatable case (Nicholas
     Piggin)
   - Optimise MSR handling in exception handling (Nicholas Piggin)
   - Support for kexec with Radix MMU (Benjamin Herrenschmidt)
   - powernv EEH fixes (Russell Currey)
   - Suprise PCI hotplug support for powernv (Gavin Shan)
   - Endian/sparse fixes for powernv PCI (Gavin Shan)
   - Defconfig updates (Anton Blanchard)
   - KVM: PPC: Book3S HV: Migrate pinned pages out of CMA (Balbir Singh)
   - cxl: Flush PSL cache before resetting the adapter (Frederic Barrat)
   - cxl: replace loop with for_each_child_of_node(), remove unneeded
     of_node_put() (Andrew Donnellan)
   - Fix HV facility unavailable to use correct handler (Nicholas
     Piggin)
   - Remove unnecessary syscall trampoline (Nicholas Piggin)
   - fadump: Fix build break when CONFIG_PROC_VMCORE=n (Michael
     Ellerman)
   - Quieten EEH message when no adapters are found (Anton Blanchard)
   - powernv: Add PHB register dump debugfs handle (Russell Currey)
   - Use kprobe blacklist for exception handlers & asm functions
     (Nicholas Piggin)
   - Document the syscall ABI (Nicholas Piggin)
   - MAINTAINERS: Update cxl maintainers (Michael Neuling)
   - powerpc: Remove all usages of NO_IRQ (Michael Ellerman)

  Minor cleanups:
   - Andrew Donnellan, Christophe Leroy, Colin Ian King, Cyril Bur,
     Frederic Barrat, Pan Xinhui, PrasannaKumar Muralidharan, Rui Teng,
     Simon Guo"

* tag 'powerpc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits)
  powerpc/bpf: Add support for bpf constant blinding
  powerpc/bpf: Implement support for tail calls
  powerpc/bpf: Introduce accessors for using the tmp local stack space
  powerpc/fadump: Fix build break when CONFIG_PROC_VMCORE=n
  powerpc: tm: Enable transactional memory (TM) lazily for userspace
  powerpc/tm: Add TM Unavailable Exception
  powerpc: Remove do_load_up_transact_{fpu,altivec}
  powerpc: tm: Rename transct_(*) to ck(\1)_state
  powerpc: tm: Always use fp_state and vr_state to store live registers
  selftests/powerpc: Add checks for transactional VSXs in signal contexts
  selftests/powerpc: Add checks for transactional VMXs in signal contexts
  selftests/powerpc: Add checks for transactional FPUs in signal contexts
  selftests/powerpc: Add checks for transactional GPRs in signal contexts
  selftests/powerpc: Check that signals always get delivered
  selftests/powerpc: Add TM tcheck helpers in C
  selftests/powerpc: Allow tests to extend their kill timeout
  selftests/powerpc: Introduce GPR asm helper header file
  selftests/powerpc: Move VMX stack frame macros to header file
  selftests/powerpc: Rework FPU stack placement macros and move to header file
  selftests/powerpc: Check for VSX preservation across userspace preemption
  ...
2016-10-07 20:19:31 -07:00
Linus Torvalds
6218590bcb KVM updates for v4.9-rc1
All architectures:
   Move `make kvmconfig` stubs from x86;  use 64 bits for debugfs stats.
 
 ARM:
   Important fixes for not using an in-kernel irqchip; handle SError
   exceptions and present them to guests if appropriate; proxying of GICV
   access at EL2 if guest mappings are unsafe; GICv3 on AArch32 on ARMv8;
   preparations for GICv3 save/restore, including ABI docs; cleanups and
   a bit of optimizations.
 
 MIPS:
   A couple of fixes in preparation for supporting MIPS EVA host kernels;
   MIPS SMP host & TLB invalidation fixes.
 
 PPC:
   Fix the bug which caused guests to falsely report lockups; other minor
   fixes; a small optimization.
 
 s390:
   Lazy enablement of runtime instrumentation; up to 255 CPUs for nested
   guests; rework of machine check deliver; cleanups and fixes.
 
 x86:
   IOMMU part of AMD's AVIC for vmexit-less interrupt delivery; Hyper-V
   TSC page; per-vcpu tsc_offset in debugfs; accelerated INS/OUTS in
   nVMX; cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJX9iDrAAoJEED/6hsPKofoOPoIAIUlgojkb9l2l1XVDgsXdgQL
 sRVhYSVv7/c8sk9vFImrD5ElOPZd+CEAIqFOu45+NM3cNi7gxip9yftUVs7wI5aC
 eDZRWm1E4trDZLe54ZM9ThcqZzZZiELVGMfR1+ZndUycybwyWzafpXYsYyaXp3BW
 hyHM3qVkoWO3dxBWFwHIoO/AUJrWYkRHEByKyvlC6KPxSdBPSa5c1AQwMCoE0Mo4
 K/xUj4gBn9eMelNhg4Oqu/uh49/q+dtdoP2C+sVM8bSdquD+PmIeOhPFIcuGbGFI
 B+oRpUhIuntN39gz8wInJ4/GRSeTuR2faNPxMn4E1i1u4LiuJvipcsOjPfe0a18=
 =fZRB
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "All architectures:
   - move `make kvmconfig` stubs from x86
   - use 64 bits for debugfs stats

  ARM:
   - Important fixes for not using an in-kernel irqchip
   - handle SError exceptions and present them to guests if appropriate
   - proxying of GICV access at EL2 if guest mappings are unsafe
   - GICv3 on AArch32 on ARMv8
   - preparations for GICv3 save/restore, including ABI docs
   - cleanups and a bit of optimizations

  MIPS:
   - A couple of fixes in preparation for supporting MIPS EVA host
     kernels
   - MIPS SMP host & TLB invalidation fixes

  PPC:
   - Fix the bug which caused guests to falsely report lockups
   - other minor fixes
   - a small optimization

  s390:
   - Lazy enablement of runtime instrumentation
   - up to 255 CPUs for nested guests
   - rework of machine check deliver
   - cleanups and fixes

  x86:
   - IOMMU part of AMD's AVIC for vmexit-less interrupt delivery
   - Hyper-V TSC page
   - per-vcpu tsc_offset in debugfs
   - accelerated INS/OUTS in nVMX
   - cleanups and fixes"

* tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (140 commits)
  KVM: MIPS: Drop dubious EntryHi optimisation
  KVM: MIPS: Invalidate TLB by regenerating ASIDs
  KVM: MIPS: Split kernel/user ASID regeneration
  KVM: MIPS: Drop other CPU ASIDs on guest MMU changes
  KVM: arm/arm64: vgic: Don't flush/sync without a working vgic
  KVM: arm64: Require in-kernel irqchip for PMU support
  KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register
  KVM: PPC: Book3S PR: Support 64kB page size on POWER8E and POWER8NVL
  KVM: PPC: Book3S: Remove duplicate setting of the B field in tlbie
  KVM: PPC: BookE: Fix a sanity check
  KVM: PPC: Book3S HV: Take out virtual core piggybacking code
  KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread
  ARM: gic-v3: Work around definition of gic_write_bpr1
  KVM: nVMX: Fix the NMI IDT-vectoring handling
  KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive
  KVM: nVMX: Fix reload apic access page warning
  kvmconfig: add virtio-gpu to config fragment
  config: move x86 kvm_guest.config to a common location
  arm64: KVM: Remove duplicating init code for setting VMID
  ARM: KVM: Support vgic-v3
  ...
2016-10-06 10:49:01 -07:00
Linus Torvalds
597f03f9d1 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
 "Yet another batch of cpu hotplug core updates and conversions:

   - Provide core infrastructure for multi instance drivers so the
     drivers do not have to keep custom lists.

   - Convert custom lists to the new infrastructure. The block-mq custom
     list conversion comes through the block tree and makes the diffstat
     tip over to more lines removed than added.

   - Handle unbalanced hotplug enable/disable calls more gracefully.

   - Remove the obsolete CPU_STARTING/DYING notifier support.

   - Convert another batch of notifier users.

   The relayfs changes which conflicted with the conversion have been
   shipped to me by Andrew.

   The remaining lot is targeted for 4.10 so that we finally can remove
   the rest of the notifiers"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  cpufreq: Fix up conversion to hotplug state machine
  blk/mq: Reserve hotplug states for block multiqueue
  x86/apic/uv: Convert to hotplug state machine
  s390/mm/pfault: Convert to hotplug state machine
  mips/loongson/smp: Convert to hotplug state machine
  mips/octeon/smp: Convert to hotplug state machine
  fault-injection/cpu: Convert to hotplug state machine
  padata: Convert to hotplug state machine
  cpufreq: Convert to hotplug state machine
  ACPI/processor: Convert to hotplug state machine
  virtio scsi: Convert to hotplug state machine
  oprofile/timer: Convert to hotplug state machine
  block/softirq: Convert to hotplug state machine
  lib/irq_poll: Convert to hotplug state machine
  x86/microcode: Convert to hotplug state machine
  sh/SH-X3 SMP: Convert to hotplug state machine
  ia64/mca: Convert to hotplug state machine
  ARM/OMAP/wakeupgen: Convert to hotplug state machine
  ARM/shmobile: Convert to hotplug state machine
  arm64/FP/SIMD: Convert to hotplug state machine
  ...
2016-10-03 19:43:08 -07:00
Balbir Singh
2e5bbb5461 KVM: PPC: Book3S HV: Migrate pinned pages out of CMA
When PCI Device pass-through is enabled via VFIO, KVM-PPC will
pin pages using get_user_pages_fast(). One of the downsides of
the pinning is that the page could be in CMA region. The CMA
region is used for other allocations like the hash page table.
Ideally we want the pinned pages to be from non CMA region.

This patch (currently only for KVM PPC with VFIO) forcefully
migrates the pages out (huge pages are omitted for the moment).
There are more efficient ways of doing this, but that might
be elaborate and might impact a larger audience beyond just
the kvm ppc implementation.

The magic is in new_iommu_non_cma_page() which allocates the
new page from a non CMA region.

I've tested the patches lightly at my end. The full solution
requires migration of THP pages in the CMA region. That work
will be done incrementally on top of this.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[mpe: Merged via powerpc tree as that's where the changes are]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-29 15:14:44 +10:00
Rui Teng
f1a55ce054 powerpc: Clean up tm_abort duplication in hash_utils_64.c
The same logic appears twice and should probably be pulled out into a
function.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Rui Teng <rui.teng@linux.vnet.ibm.com>
[mpe: Rename to tm_flush_hash_page() and move comment into the function]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-23 07:54:23 +10:00
Christophe Leroy
6b8cb66a6a powerpc: Fix usage of _PAGE_RO in hugepage
On some CPUs like the 8xx, _PAGE_RW hence _PAGE_WRITE is defined
as 0 and _PAGE_RO has to be set when a page is not writable

_PAGE_RO is defined by default in pte-common.h, however BOOK3S/64
doesn't include that file so _PAGE_RO has to be defined explicitly
in book3s/64/pgtable.h

Fixes: a7b9f671f2 ("powerpc32: adds handling of _PAGE_RO")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-23 07:54:22 +10:00
Aneesh Kumar K.V
be34d30059 powerpc/mm: Add radix flush all with IS=3
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-23 07:54:18 +10:00
Benjamin Herrenschmidt
fe036a0605 powerpc/64/kexec: Fix MMU cleanup on radix
Just using the hash ops won't work anymore since radix will have
NULL in there. Instead create an mmu_cleanup_all() function which
will do the right thing based on the MMU mode.

For Radix, for now I clear UPRT and the PTCR, effectively switching
back to Radix with no partition table setup.

Currently set it to NULL on BookE thought it might be a good idea
to wipe the TLB there (Scott ?)

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-23 07:54:17 +10:00
Nicholas Piggin
03465f899b powerpc: Use kprobe blacklist for exception handlers
Currently we mark the C implementations of some exception handlers as
__kprobes. This has the effect of putting them in the ".kprobes.text"
section, which separates them from the rest of the text.

Instead we can use the blacklist macros to add the symbols to a
blacklist which kprobes will check. This allows the linker to move
exception handler functions close to callers and avoids trampolines in
larger kernels.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Reword change log a bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-19 10:53:54 +10:00
Colin Ian King
3daf3c2069 powerpc/32: Add missing \n and switch to pr_warn()
The message is missing a \n, add it. Switch to pr_warn(), it's shorter
and less ugly.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-13 17:37:11 +10:00
Aneesh Kumar K.V
ad410674f5 powerpc/mm: Update the HID bit when switching from radix to hash
Power9 DD1 requires to update the hid0 register when switching from
hash to radix.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-13 17:37:10 +10:00
Aneesh Kumar K.V
c6d1a767b9 powerpc/mm/radix: Use different pte update sequence for different POWER9 revs
POWER9 DD1 requires pte to be marked invalid (V=0) before updating
it with the new value. This makes this distinction for the different
revisions.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-13 17:37:10 +10:00
Michael Ellerman
68201fbbb0 powerpc/Makefile: Drop CONFIG_WORD_SIZE for BITS
Commit 2578bfae84 ("[POWERPC] Create and use CONFIG_WORD_SIZE") added
CONFIG_WORD_SIZE, and suggests that other arches were going to do
likewise.

But that never happened, powerpc is the only architecture which uses it.

So switch to using a simple make variable, BITS, like x86, sh, sparc and
tile. It is also easier to spell and simpler, avoiding any confusion
about whether it's defined due to ordering of make vs kconfig.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-13 17:37:06 +10:00
Paul Mackerras
f0f558b131 powerpc/mm: Preserve CFAR value on SLB miss caused by access to bogus address
Currently, if userspace or the kernel accesses a completely bogus address,
for example with any of bits 46-59 set, we first take an SLB miss interrupt,
install a corresponding SLB entry with VSID 0, retry the instruction, then
take a DSI/ISI interrupt because there is no HPT entry mapping the address.
However, by the time of the second interrupt, the Come-From Address Register
(CFAR) has been overwritten by the rfid instruction at the end of the SLB
miss interrupt handler.  Since bogus accesses can often be caused by a
function return after the stack has been overwritten, the CFAR value would
be very useful as it could indicate which function it was whose return had
led to the bogus address.

This patch adds code to create a full exception frame in the SLB miss handler
in the case of a bogus address, rather than inserting an SLB entry with a
zero VSID field.  Then we call a new slb_miss_bad_addr() function in C code,
which delivers a signal for a user access or creates an oops for a kernel
access.  In the latter case the oops message will show the CFAR value at the
time of the access.

In the case of the radix MMU, a segment miss interrupt indicates an access
outside the ranges mapped by the page tables.  Previously this was handled
by the code for an unrecoverable SLB miss (one with MSR[RI] = 0), which is
not really correct.  With this patch, we now handle these interrupts with
slb_miss_bad_addr(), which is much more consistent.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-13 17:37:03 +10:00
Paul Mackerras
0eeede0c63 powerpc/mm: Speed up computation of base and actual page size for a HPTE
This replaces a 2-D search through an array with a simple 8-bit table
lookup for determining the actual and/or base page size for a HPT entry.

The encoding in the second doubleword of the HPTE is designed to encode
the actual and base page sizes without using any more bits than would be
needed for a 4k page number, by using between 1 and 8 low-order bits of
the RPN (real page number) field to encode the page sizes.  A single
"large page" bit in the first doubleword indicates that these low-order
bits are to be interpreted like this.

We can determine the page sizes by using the low-order 8 bits of the RPN
to look up a 256-entry table.  For actual page sizes less than 1MB, some
of the upper bits of these 8 bits are going to be real address bits, but
we can cope with that by replicating the entries for those smaller page
sizes.

While we're at it, let's move the hpte_page_size() and hpte_base_page_size()
functions from a KVM-specific header to a header for 64-bit HPT systems,
since this computation doesn't have anything specifically to do with KVM.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-09 16:14:48 +10:00
Paul Mackerras
f077aaf075 powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET
In commit c60ac5693c ("powerpc: Update kernel VSID range", 2013-03-13)
we lost a check on the region number (the top four bits of the effective
address) for addresses below PAGE_OFFSET.  That commit replaced a check
that the top 18 bits were all zero with a check that bits 46 - 59 were
zero (performed for all addresses, not just user addresses).

This means that userspace can access an address like 0x1000_0xxx_xxxx_xxxx
and we will insert a valid SLB entry for it.  The VSID used will be the
same as if the top 4 bits were 0, but the page size will be some random
value obtained by indexing beyond the end of the mm_ctx_high_slices_psize
array in the paca.  If that page size is the same as would be used for
region 0, then userspace just has an alias of the region 0 space.  If the
page size is different, then no HPTE will be found for the access, and
the process will get a SIGSEGV (since hash_page_mm() will refuse to create
a HPTE for the bogus address).

The access beyond the end of the mm_ctx_high_slices_psize can be at most
5.5MB past the array, and so will be in RAM somewhere.  Since the access
is a load performed in real mode, it won't fault or crash the kernel.
At most this bug could perhaps leak a little bit of information about
blocks of 32 bytes of memory located at offsets of i * 512kB past the
paca->mm_ctx_high_slices_psize array, for 1 <= i <= 11.

Fixes: c60ac5693c ("powerpc: Update kernel VSID range")
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-08 13:15:33 +10:00
Sebastian Andrzej Siewior
da3ed6519b powerpc/mmu nohash: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: rt@linutronix.de
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160818125731.27256-17-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:27 +02:00
Paul Gortmaker
8a39b05f08 powerpc: migrate exception table users off module.h and onto extable.h
These files were only including module.h for exception table
related functions.  We've now separated that content out into its
own file "extable.h" so now move over to that and avoid all the
extra header content in module.h that we don't really need to compile
these files.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2016-08-22 11:09:33 +10:00
Al Viro
9445aa1a30 ppc: move exports to definitions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-08-07 23:50:09 -04:00
Linus Torvalds
2cfd716d27 powerpc updates for 4.8 #2
Fixes:
  - Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
  - Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
  - Move register_process_table() out of ppc_md from Michael Ellerman
 
 Use jump_label for [cpu|mmu]_has_feature() from Aneesh Kumar K.V, Kevin Hao and Michael Ellerman:
  - Add mmu_early_init_devtree() from Michael Ellerman
  - Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
  - Do hash device tree scanning earlier from Michael Ellerman
  - Do radix device tree scanning earlier from Michael Ellerman
  - Do feature patching before MMU init from Michael Ellerman
  - Check features don't change after patching from Michael Ellerman
  - Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
  - Convert mmu_has_feature() to returning bool from Michael Ellerman
  - Convert cpu_has_feature() to returning bool from Michael Ellerman
  - Define radix_enabled() in one place & use static inline from Michael Ellerman
  - Add early_[cpu|mmu]_has_feature() from Michael Ellerman
  - Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
  - jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
  - Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
  - Remove mfvtb() from Kevin Hao
  - Move cpu_has_feature() to a separate file from Kevin Hao
  - Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
  - Add option to use jump label for cpu_has_feature() from Kevin Hao
  - Add option to use jump label for mmu_has_feature() from Kevin Hao
  - Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
  - Annotate jump label assembly from Michael Ellerman
 
 TLB flush enhancements from Aneesh Kumar K.V:
  - radix: Implement tlb mmu gather flush efficiently
  - Add helper for finding SLBE LLP encoding
  - Use hugetlb flush functions
  - Drop multiple definition of mm_is_core_local
  - radix: Add tlb flush of THP ptes
  - radix: Rename function and drop unused arg
  - radix/hugetlb: Add helper for finding page size
  - hugetlb: Add flush_hugetlb_tlb_range
  - remove flush_tlb_page_nohash
 
 Add new ptrace regsets from Anshuman Khandual and Simon Guo:
  - elf: Add powerpc specific core note sections
  - Add the function flush_tmregs_to_thread
  - Enable in transaction NT_PRFPREG ptrace requests
  - Enable in transaction NT_PPC_VMX ptrace requests
  - Enable in transaction NT_PPC_VSX ptrace requests
  - Adapt gpr32_get, gpr32_set functions for transaction
  - Enable support for NT_PPC_CGPR
  - Enable support for NT_PPC_CFPR
  - Enable support for NT_PPC_CVMX
  - Enable support for NT_PPC_CVSX
  - Enable support for TM SPR state
  - Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
  - Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
  - Enable support for EBB registers
  - Enable support for Performance Monitor registers
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXpGaLAAoJEFHr6jzI4aWA9aYP/1AqmRPJ9D0XVUJWT+FVABUK
 LESESoVFF4Hug1j1F8Synhg5o4SzD2t45iGKbclYaFthOIyovMg7Wr1KSu4hQ0go
 rPuQfpXDNQ8jKdDX8hbPXKUxrNRBNfqJGFo5E7mO6wN9AJ9d1LVwQ+jKAva29Tqs
 LaAlMbQNbeObPNzOl73B73iew3aozr+mXjBqv82lqvgYknBD2CLf24xGG3eNIbq5
 ZZk4LPC8pdkaxnajnzRFzqwiyPWzao0yfpVRKh52TKHBQF/prR/KACb6zUuja/61
 krOfegUKob14OYrehjs6X8XNRLnILRI0u1H5bmj7eVEiY/usyNzE93SMHZM3Wdau
 sQF/Au4OLNXj0ZQdNBtzRsZRyp1d560Gsj+lQGBoPd4hfIWkFYHvxzxsUSdqv4uA
 MWDMwN0Vvfk0cpprsabsWNevkaotYYBU00px5hF/e5ZUc9/x/xYUVMgPEDr0QZLr
 cHJo9/Pjk4u/0g4lj+2y1LLl/0tNEZZg69O6bvffPAPVSS4/P4y/bKKYd4I0zL99
 Ykp91mSmkl70F3edgOSFqyda2gN2l2Ekb/i081YGXheFy1rbD29Vxv82BOVog4KY
 ibvOqp38WDzCVk5OXuCRvBl0VudLKGJYdppU1nXg4KgrTZXHeCAC0E+NzUsgOF4k
 OMvQ+5drVxrno+Hw8FVJ
 =0Q8E
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull more powerpc updates from Michael Ellerman:
 "These were delayed for various reasons, so I let them sit in next a
  bit longer, rather than including them in my first pull request.

  Fixes:
   - Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
   - Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
   - Move register_process_table() out of ppc_md from Michael Ellerman

  Use jump_label use for [cpu|mmu]_has_feature():
   - Add mmu_early_init_devtree() from Michael Ellerman
   - Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
   - Do hash device tree scanning earlier from Michael Ellerman
   - Do radix device tree scanning earlier from Michael Ellerman
   - Do feature patching before MMU init from Michael Ellerman
   - Check features don't change after patching from Michael Ellerman
   - Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
   - Convert mmu_has_feature() to returning bool from Michael Ellerman
   - Convert cpu_has_feature() to returning bool from Michael Ellerman
   - Define radix_enabled() in one place & use static inline from Michael Ellerman
   - Add early_[cpu|mmu]_has_feature() from Michael Ellerman
   - Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
   - jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
   - Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
   - Remove mfvtb() from Kevin Hao
   - Move cpu_has_feature() to a separate file from Kevin Hao
   - Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
   - Add option to use jump label for cpu_has_feature() from Kevin Hao
   - Add option to use jump label for mmu_has_feature() from Kevin Hao
   - Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
   - Annotate jump label assembly from Michael Ellerman

  TLB flush enhancements from Aneesh Kumar K.V:
   - radix: Implement tlb mmu gather flush efficiently
   - Add helper for finding SLBE LLP encoding
   - Use hugetlb flush functions
   - Drop multiple definition of mm_is_core_local
   - radix: Add tlb flush of THP ptes
   - radix: Rename function and drop unused arg
   - radix/hugetlb: Add helper for finding page size
   - hugetlb: Add flush_hugetlb_tlb_range
   - remove flush_tlb_page_nohash

  Add new ptrace regsets from Anshuman Khandual and Simon Guo:
   - elf: Add powerpc specific core note sections
   - Add the function flush_tmregs_to_thread
   - Enable in transaction NT_PRFPREG ptrace requests
   - Enable in transaction NT_PPC_VMX ptrace requests
   - Enable in transaction NT_PPC_VSX ptrace requests
   - Adapt gpr32_get, gpr32_set functions for transaction
   - Enable support for NT_PPC_CGPR
   - Enable support for NT_PPC_CFPR
   - Enable support for NT_PPC_CVMX
   - Enable support for NT_PPC_CVSX
   - Enable support for TM SPR state
   - Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
   - Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
   - Enable support for EBB registers
   - Enable support for Performance Monitor registers"

* tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
  powerpc/mm: Move register_process_table() out of ppc_md
  powerpc/perf: Fix incorrect event codes in power9-event-list
  powerpc/32: Fix early access to cpu_spec relocation
  powerpc/ptrace: Enable support for Performance Monitor registers
  powerpc/ptrace: Enable support for EBB registers
  powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
  powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
  powerpc/ptrace: Enable support for TM SPR state
  powerpc/ptrace: Enable support for NT_PPC_CVSX
  powerpc/ptrace: Enable support for NT_PPC_CVMX
  powerpc/ptrace: Enable support for NT_PPC_CFPR
  powerpc/ptrace: Enable support for NT_PPC_CGPR
  powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
  powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
  powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
  powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
  powerpc/process: Add the function flush_tmregs_to_thread
  elf: Add powerpc specific core note sections
  powerpc/mm: remove flush_tlb_page_nohash
  powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
  ...
2016-08-05 09:00:54 -04:00
Michael Ellerman
eea8148c69 powerpc/mm: Move register_process_table() out of ppc_md
We want to initialise register_process_table() before ppc_md is setup,
so that it can be called as part of MMU init (at least on Radix ATM).

That no longer works because probe_machine() requires that ppc_md be
empty before it's called, and we now do probe_machine() much later.

So make register_process_table a global for now. It will probably move
into a mmu_radix_ops struct at some point in the future.

This was broken by me when applying commit 7025776ed1 "powerpc/mm:
Move hash table ops to a separate structure" due to conflicts with other
patches.

Fixes: 7025776ed1 ("powerpc/mm: Move hash table ops to a separate structure")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-04 20:22:34 +10:00
Fabian Frederick
bd721ea73e treewide: replace obsolete _refok by __ref
There was only one use of __initdata_refok and __exit_refok

__init_refok was used 46 times against 82 for __ref.

Those definitions are obsolete since commit 312b1485fb ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")

This patch removes the following compatibility definitions and replaces
them treewide.

/* compatibility defines */
#define __init_refok     __ref
#define __initdata_refok __refdata
#define __exit_refok     __ref

I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Aneesh Kumar K.V
703b41ad1a powerpc/mm: remove flush_tlb_page_nohash
This should be same as flush_tlb_page except for hash32. For hash32
I guess the existing code is wrong, because we don't seem to be
flushing tlb for Hash != 0 case at all. Fix this by switching to
calling flush_tlb_page() which does the right thing by flushing
tlb for both hash and nohash case with hash32

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:13 +10:00
Aneesh Kumar K.V
5491ae7b6f powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
Some archs like ppc64 need to do special things when flushing tlb for
hugepage. Add a new helper to flush hugetlb tlb range. This helps us to
avoid flushing the entire tlb mapping for the pid.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:13 +10:00
Aneesh Kumar K.V
fbfa26d854 powerpc/mm/radix/hugetlb: Add helper for finding page size from hstate
Use the helper instead of open coding the same at multiple place

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:12 +10:00
Aneesh Kumar K.V
f22dfc9158 powerpc/mm/radix: Rename function and drop unused arg
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:11 +10:00
Aneesh Kumar K.V
d8e91e93e9 powerpc/mm/radix: Add tlb flush of THP ptes
Instead of flushing the entire mm, implement a flush_pmd_tlb_range

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:10 +10:00
Aneesh Kumar K.V
9d4dab1158 powerpc/mm: Drop multiple definition of mm_is_core_local
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:10 +10:00
Aneesh Kumar K.V
138ee7ee01 powerpc/mm/hash: Add helper for finding SLBE LLP encoding
Replace opencoding of the same at multiple places with the helper.
No functional change with this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:08 +10:00
Aneesh Kumar K.V
8cb8140c4c powerpc/mm/radix: Implement tlb mmu gather flush efficiently
Now that we track page size in mmu_gather, we can use address based
tlbie format when doing a tlb_flush(). We don't do this if we are
invalidating the full address space.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:08 +10:00
Aneesh Kumar K.V
b8f1b4f860 powerpc/mm: Convert early cpu/mmu feature check to use the new helpers
This switches early feature checks to use the non static key variant of
the function. In later patches we will be switching cpu_has_feature()
and mmu_has_feature() to use static keys and we can use them only after
static key/jump label is initialized. Any check for feature before jump
label init should be done using this new helper.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:01 +10:00
Aneesh Kumar K.V
5a25b6f527 powerpc/mm: Make MMU_FTR_RADIX a MMU family feature
MMU feature bits are defined such that we use the lower half to
present MMU family features. Remove the strict split of half and
also move Radix to a mmu family feature. Radix introduce a new MMU
model and strictly speaking it is a new MMU family. This also free
up bits which can be used for individual features later.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:14:57 +10:00
Michael Ellerman
2537b09c93 powerpc/mm: Do radix device tree scanning earlier
Like we just did for hash, split the device tree scanning parts out and
call them from mmu_early_init_devtree().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:14:55 +10:00
Michael Ellerman
bacf9cf883 powerpc/mm: Do hash device tree scanning earlier
Currently MMU initialisation (early_init_mmu()) consists of a mixture of
scanning the device tree, setting MMU feature bits, and then also doing
actual initialisation of MMU data structures.

We'd like to decouple the setting of the MMU features from the actual
setup. So split out the device tree scanning, and associated code, and
call it from mmu_init_early_devtree().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:14:54 +10:00
Michael Ellerman
c610ec60ed powerpc/mm: Move disable_radix handling into mmu_early_init_devtree()
Move the handling of the disable_radix command line argument into the
newly created mmu_early_init_devtree().

It's an MMU option so it's preferable to have it in an mm related file,
and it also means platforms that don't support radix don't have to carry
the code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:14:53 +10:00
Michael Ellerman
1a01dc87e0 powerpc/mm: Add mmu_early_init_devtree()
Empty for now, but we'll add to it in the next patch.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:14:53 +10:00
Linus Torvalds
bad60e6f25 powerpc updates for 4.8 # 1
Highlights:
  - PowerNV PCI hotplug support.
  - Lots more Power9 support.
  - eBPF JIT support on ppc64le.
  - Lots of cxl updates.
  - Boot code consolidation.
 
 Bug fixes:
  - Fix spin_unlock_wait() from Boqun Feng
  - Fix stack pointer corruption in __tm_recheckpoint() from Michael Neuling
  - Fix multiple bugs in memory_hotplug_max() from Bharata B Rao
  - mm: Ensure "special" zones are empty from Oliver O'Halloran
  - ftrace: Separate the heuristics for checking call sites from Michael Ellerman
  - modules: Never restore r2 for a mprofile-kernel style mcount() call from Michael Ellerman
  - Fix endianness when reading TCEs from Alexey Kardashevskiy
  - start rtasd before PCI probing from Greg Kurz
  - PCI: rpaphp: Fix slot registration for multiple slots under a PHB from Tyrel Datwyler
  - powerpc/mm: Add memory barrier in __hugepte_alloc() from Sukadev Bhattiprolu
 
 Cleanups & fixes:
  - Drop support for MPIC in pseries from Rashmica Gupta
  - Define and use PPC64_ELF_ABI_v2/v1 from Michael Ellerman
  - Remove unused symbols in asm-offsets.c from Rashmica Gupta
  - Fix SRIOV not building without EEH enabled from Russell Currey
  - Remove kretprobe_trampoline_holder. from Thiago Jung Bauermann
  - Reduce log level of PCI I/O space warning from Benjamin Herrenschmidt
  - Add array bounds checking to crash_shutdown_handlers from Suraj Jitindar Singh
  - Avoid -maltivec when using clang integrated assembler from Anton Blanchard
  - Fix array overrun in ppc_rtas() syscall from Andrew Donnellan
  - Fix error return value in cmm_mem_going_offline() from Rasmus Villemoes
  - export cpu_to_core_id() from Mauricio Faria de Oliveira
  - Remove old symbols from defconfigs from Andrew Donnellan
  - Update obsolete comments in setup_32.c about entry conditions from Benjamin Herrenschmidt
  - Add comment explaining the purpose of setup_kdump_trampoline() from Benjamin Herrenschmidt
  - Merge the RELOCATABLE config entries for ppc32 and ppc64 from Kevin Hao
  - Remove RELOCATABLE_PPC32 from Kevin Hao
  - Fix .long's in tlb-radix.c to more meaningful from Balbir Singh
 
 Minor cleanups & fixes:
  - Andrew Donnellan, Anna-Maria Gleixner, Anton Blanchard, Benjamin
    Herrenschmidt, Bharata B Rao, Christophe Leroy, Colin Ian King, Geliang
    Tang, Greg Kurz, Madhavan Srinivasan, Michael Ellerman, Michael Ellerman,
    Stephen Rothwell, Stewart Smith.
 
 Freescale updates from Scott:
  - "Highlights include more 8xx optimizations, device tree updates,
    and MVME7100 support."
 
 PowerNV PCI hotplug from Gavin Shan:
  - PCI: Add pcibios_setup_bridge()
  - Override pcibios_setup_bridge()
  - Remove PCI_RESET_DELAY_US
  - Move pnv_pci_ioda_setup_opal_tce_kill() around
  - Increase PE# capacity
  - Allocate PE# in reverse order
  - Create PEs in pcibios_setup_bridge()
  - Setup PE for root bus
  - Extend PCI bridge resources
  - Make pnv_ioda_deconfigure_pe() visible
  - Dynamically release PE
  - Update bridge windows on PCI plug
  - Delay populating pdn
  - Support PCI slot ID
  - Use PCI slot reset infrastructure
  - Introduce pnv_pci_get_slot_id()
  - Functions to get/set PCI slot state
  - PCI/hotplug: PowerPC PowerNV PCI hotplug driver
  - Print correct PHB type names
 
 Power9 idle support from Shreyas B. Prabhu:
  - set power_save func after the idle states are initialized
  - Use PNV_THREAD_WINKLE macro while requesting for winkle
  - make hypervisor state restore a function
  - Rename idle_power7.S to idle_book3s.S
  - Rename reusable idle functions to hardware agnostic names
  - Make pnv_powersave_common more generic
  - abstraction for saving SPRs before entering deep idle states
  - Add platform support for stop instruction
  - cpuidle/powernv: Use CPUIDLE_STATE_MAX instead of MAX_POWERNV_IDLE_STATES
  - cpuidle/powernv: cleanup cpuidle-powernv.c
  - cpuidle/powernv: Add support for POWER ISA v3 idle states
  - Use deepest stop state when cpu is offlined
 
 Power9 PMU from Madhavan Srinivasan:
  - factor out power8 pmu macros and defines
  - factor out power8 pmu functions
  - factor out power8 __init_pmu code
  - Add power9 event list macros for generic and cache events
  - Power9 PMU support
  - Export Power9 generic and cache events to sysfs
 
 Power9 preliminary interrupt & PCI support from Benjamin Herrenschmidt:
  - Add XICS emulation APIs
  - Move a few exception common handlers to make room
  - Add support for HV virtualization interrupts
  - Add mechanism to force a replay of interrupts
  - Add ICP OPAL backend
  - Discover IODA3 PHBs
  - pci: Remove obsolete SW invalidate
  - opal: Add real mode call wrappers
  - Rename TCE invalidation calls
  - Remove SWINV constants and obsolete TCE code
  - Rework accessing the TCE invalidate register
  - Fallback to OPAL for TCE invalidations
  - Use the device-tree to get available range of M64's
  - Check status of a PHB before using it
  - pci: Don't try to allocate resources that will be reassigned
 
 Other Power9:
  - Send SIGBUS on unaligned copy and paste from Chris Smart
  - Large Decrementer support from Oliver O'Halloran
  - Load Monitor Register Support from Jack Miller
 
 Performance improvements from Anton Blanchard:
  - Avoid load hit store in __giveup_fpu() and __giveup_altivec()
  - Avoid load hit store in setup_sigcontext()
  - Remove assembly versions of strcpy, strcat, strlen and strcmp
  - Align hot loops of some string functions
 
 eBPF JIT from Naveen N. Rao:
  - Fix/enhance 32-bit Load Immediate implementation
  - Optimize 64-bit Immediate loads
  - Introduce rotate immediate instructions
  - A few cleanups
  - Isolate classic BPF JIT specifics into a separate header
  - Implement JIT compiler for extended BPF
 
 Operator Panel driver from Suraj Jitindar Singh:
  - devicetree/bindings: Add binding for operator panel on FSP machines
  - Add inline function to get rc from an ASYNC_COMP opal_msg
  - Add driver for operator panel on FSP machines
 
 Sparse fixes from Daniel Axtens:
  - make some things static
  - Introduce asm-prototypes.h
  - Include headers containing prototypes
  - Use #ifdef __BIG_ENDIAN__ #else for REG_BYTE
  - kvm: Clarify __user annotations
  - Pass endianness to sparse
  - Make ppc_md.{halt, restart} __noreturn
 
 MM fixes & cleanups from Aneesh Kumar K.V:
  - radix: Update LPCR HR bit as per ISA
  - use _raw variant of page table accessors
  - Compile out radix related functions if RADIX_MMU is disabled
  - Clear top 16 bits of va only on older cpus
  - Print formation regarding the the MMU mode
  - hash: Update SDR1 size encoding as documented in ISA 3.0
  - radix: Update PID switch sequence
  - radix: Update machine call back to support new HCALL.
  - radix: Add LPID based tlb flush helpers
  - radix: Add a kernel command line to disable radix
  - Cleanup LPCR defines
 
 Boot code consolidation from Benjamin Herrenschmidt:
  - Move epapr_paravirt_early_init() to early_init_devtree()
  - cell: Don't use flat device-tree after boot
  - ge_imp3a: Don't use the flat device-tree after boot
  - mpc85xx_ds: Don't use the flat device-tree after boot
  - mpc85xx_rdb: Don't use the flat device-tree after boot
  - Don't test for machine type in rtas_initialize()
  - Don't test for machine type in smp_setup_cpu_maps()
  - dt: Add of_device_compatible_match()
  - Factor do_feature_fixup calls
  - Move 64-bit feature fixup earlier
  - Move 64-bit memory reserves to setup_arch()
  - Use a cachable DART
  - Move FW feature probing out of pseries probe()
  - Put exception configuration in a common place
  - Remove early allocation of the SMU command buffer
  - Move MMU backend selection out of platform code
  - pasemi: Remove IOBMAP allocation from platform probe()
  - mm/hash: Don't use machine_is() early during boot
  - Don't test for machine type to detect HEA special case
  - pmac: Remove spurrious machine type test
  - Move hash table ops to a separate structure
  - Ensure that ppc_md is empty before probing for machine type
  - Move 64-bit probe_machine() to later in the boot process
  - Move 32-bit probe() machine to later in the boot process
  - Get rid of ppc_md.init_early()
  - Move the boot time info banner to a separate function
  - Move setting of {i,d}cache_bsize to initialize_cache_info()
  - Move the content of setup_system() to setup_arch()
  - Move cache info inits to a separate function
  - Re-order the call to smp_setup_cpu_maps()
  - Re-order setup_panic()
  - Make a few boot functions __init
  - Merge 32-bit and 64-bit setup_arch()
 
 Other new features:
  - tty/hvc: Use IRQF_SHARED for OPAL hvc consoles from Sam Mendoza-Jonas
  - tty/hvc: Use opal irqchip interface if available from Sam Mendoza-Jonas
  - powerpc: Add module autoloading based on CPU features from Alastair D'Silva
  - crypto: vmx - Convert to CPU feature based module autoloading from Alastair D'Silva
  - Wake up kopald polling thread before waiting for events from Benjamin Herrenschmidt
  - xmon: Dump ISA 2.06 SPRs from Michael Ellerman
  - xmon: Dump ISA 2.07 SPRs from Michael Ellerman
  - Add a parameter to disable 1TB segs from Oliver O'Halloran
  - powerpc/boot: Add OPAL console to epapr wrappers from Oliver O'Halloran
  - Assign fixed PHB number based on device-tree properties from Guilherme G. Piccoli
  - pseries: Add pseries hotplug workqueue from John Allen
  - pseries: Add support for hotplug interrupt source from John Allen
  - pseries: Use kernel hotplug queue for PowerVM hotplug events from John Allen
  - pseries: Move property cloning into its own routine from Nathan Fontenot
  - pseries: Dynamic add entires to associativity lookup array from Nathan Fontenot
  - pseries: Auto-online hotplugged memory from Nathan Fontenot
  - pseries: Remove call to memblock_add() from Nathan Fontenot
 
 cxl:
  - Add set and get private data to context struct from Michael Neuling
  - make base more explicitly non-modular from Paul Gortmaker
  - Use for_each_compatible_node() macro from Wei Yongjun
  - Frederic Barrat
    - Abstract the differences between the PSL and XSL
    - Make vPHB device node match adapter's
  - Philippe Bergheaud
    - Add mechanism for delivering AFU driver specific events
    - Ignore CAPI adapters misplaced in switched slots
    - Refine slice error debug messages
  - Andrew Donnellan
    - static-ify variables to fix sparse warnings
    - PCI/hotplug: pnv_php: export symbols and move struct types needed by cxl
    - PCI/hotplug: pnv_php: handle OPAL_PCI_SLOT_OFFLINE power state
    - Add cxl_check_and_switch_mode() API to switch bi-modal cards
    - remove dead Kconfig options
    - fix potential NULL dereference in free_adapter()
  - Ian Munsie
    - Update process element after allocating interrupts
    - Add support for CAPP DMA mode
    - Fix allowing bogus AFU descriptors with 0 maximum processes
    - Fix allocating a minimum of 2 pages for the SPA
    - Fix bug where AFU disable operation had no effect
    - Workaround XSL bug that does not clear the RA bit after a reset
    - Fix NULL pointer dereference on kernel contexts with no AFU interrupts
    - powerpc/powernv: Split cxl code out into a separate file
    - Add cxl_slot_is_supported API
    - Enable bus mastering for devices using CAPP DMA mode
    - Move cxl_afu_get / cxl_afu_put to base
    - Allow a default context to be associated with an external pci_dev
    - Do not create vPHB if there are no AFU configuration records
    - powerpc/powernv: Add support for the cxl kernel api on the real phb
    - Add support for using the kernel API with a real PHB
    - Add kernel APIs to get & set the max irqs per context
    - Add preliminary workaround for CX4 interrupt limitation
    - Add support for interrupts on the Mellanox CX4
    - Workaround PE=0 hardware limitation in Mellanox CX4
    - powerpc/powernv: Fix pci-cxl.c build when CONFIG_MODULES=n
 
 selftests:
  - Test unaligned copy and paste from Chris Smart
  - Load Monitor Register Tests from Jack Miller
  - Cyril Bur
    - exec() with suspended transaction
    - Use signed long to read perf_event_paranoid
    - Fix usage message in context_switch
    - Fix generation of vector instructions/types in context_switch
  - Michael Ellerman
    - Use "Delta" rather than "Error" in normal output
    - Import Anton's mmap & futex micro benchmarks
    - Add a test for PROT_SAO
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXnWchAAoJEFHr6jzI4aWAe64P/36Vd9yJLptjkoyZp8/IQtu1
 Cv8buQwGdKuSMzdkcUAOXcC3fe2u70ZWXMKKLfY3koIV1IAiqdWk5/XWRKMP2XmE
 dG0LhSf0uu7uh+mE0WvQnRu46ImeKtQ+mPp4Hbs/s9SxMSeYjruv3vdWWmgUq0cl
 Gac2qJSRtAMmgLuHWMjf7N5mxOTOnKejU4o2i9cJ+YHmWKOdCigv2Ge1UadOQFlC
 E7tRPiUR3asfDfj+e+LVTTdToH6p8pk+mOUzIoZ8jIkQ+IXzi62UDl5+Rw9mqiuX
 1CtqEMUXxo2qwX+d4TcV/QUOp0YKPuIcUZ9NMMS+S3lOyJ4NFt+j2Izk7QJp5kNP
 gKVqB68TjDQsBuDr3P9ynlHbduxTIhZAqopbTrLe0FIg48nUe4n1yHJBVzqaVajX
 rFBJSsSUffBLAARNPSXJJhIgc2C1/qOC8dgMeDMcR2kPirDHaQZ/lY1yEpq1yiqR
 q6e3v5hvIAm4IjbYk0mF7TUxBrPGVE/ExyBINyASRoYxAJ1PyeD/iljZ9vI3asRA
 s+hhxT8H3f7lnqTrmJqMjHgAdGkmag07EdmvFNX4xK4aADSy7Y6g4dw25ffRopo9
 p9Jf9HX+dZv65Y3UjbV/6HuXcaSEBJJLSVWvii65PebqSN0LuHEFvNeIJ6Iblx0B
 AWh/hd0Iin2gdkcG39Mr
 =Z5kM
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights:
   - PowerNV PCI hotplug support.
   - Lots more Power9 support.
   - eBPF JIT support on ppc64le.
   - Lots of cxl updates.
   - Boot code consolidation.

  Bug fixes:
   - Fix spin_unlock_wait() from Boqun Feng
   - Fix stack pointer corruption in __tm_recheckpoint() from Michael
     Neuling
   - Fix multiple bugs in memory_hotplug_max() from Bharata B Rao
   - mm: Ensure "special" zones are empty from Oliver O'Halloran
   - ftrace: Separate the heuristics for checking call sites from
     Michael Ellerman
   - modules: Never restore r2 for a mprofile-kernel style mcount() call
     from Michael Ellerman
   - Fix endianness when reading TCEs from Alexey Kardashevskiy
   - start rtasd before PCI probing from Greg Kurz
   - PCI: rpaphp: Fix slot registration for multiple slots under a PHB
     from Tyrel Datwyler
   - powerpc/mm: Add memory barrier in __hugepte_alloc() from Sukadev
     Bhattiprolu

  Cleanups & fixes:
   - Drop support for MPIC in pseries from Rashmica Gupta
   - Define and use PPC64_ELF_ABI_v2/v1 from Michael Ellerman
   - Remove unused symbols in asm-offsets.c from Rashmica Gupta
   - Fix SRIOV not building without EEH enabled from Russell Currey
   - Remove kretprobe_trampoline_holder from Thiago Jung Bauermann
   - Reduce log level of PCI I/O space warning from Benjamin
     Herrenschmidt
   - Add array bounds checking to crash_shutdown_handlers from Suraj
     Jitindar Singh
   - Avoid -maltivec when using clang integrated assembler from Anton
     Blanchard
   - Fix array overrun in ppc_rtas() syscall from Andrew Donnellan
   - Fix error return value in cmm_mem_going_offline() from Rasmus
     Villemoes
   - export cpu_to_core_id() from Mauricio Faria de Oliveira
   - Remove old symbols from defconfigs from Andrew Donnellan
   - Update obsolete comments in setup_32.c about entry conditions from
     Benjamin Herrenschmidt
   - Add comment explaining the purpose of setup_kdump_trampoline() from
     Benjamin Herrenschmidt
   - Merge the RELOCATABLE config entries for ppc32 and ppc64 from Kevin
     Hao
   - Remove RELOCATABLE_PPC32 from Kevin Hao
   - Fix .long's in tlb-radix.c to more meaningful from Balbir Singh

  Minor cleanups & fixes:
   - Andrew Donnellan, Anna-Maria Gleixner, Anton Blanchard, Benjamin
     Herrenschmidt, Bharata B Rao, Christophe Leroy, Colin Ian King,
     Geliang Tang, Greg Kurz, Madhavan Srinivasan, Michael Ellerman,
     Michael Ellerman, Stephen Rothwell, Stewart Smith.

  Freescale updates from Scott:
   - "Highlights include more 8xx optimizations, device tree updates,
     and MVME7100 support."

  PowerNV PCI hotplug from Gavin Shan:
   - PCI: Add pcibios_setup_bridge()
   - Override pcibios_setup_bridge()
   - Remove PCI_RESET_DELAY_US
   - Move pnv_pci_ioda_setup_opal_tce_kill() around
   - Increase PE# capacity
   - Allocate PE# in reverse order
   - Create PEs in pcibios_setup_bridge()
   - Setup PE for root bus
   - Extend PCI bridge resources
   - Make pnv_ioda_deconfigure_pe() visible
   - Dynamically release PE
   - Update bridge windows on PCI plug
   - Delay populating pdn
   - Support PCI slot ID
   - Use PCI slot reset infrastructure
   - Introduce pnv_pci_get_slot_id()
   - Functions to get/set PCI slot state
   - PCI/hotplug: PowerPC PowerNV PCI hotplug driver
   - Print correct PHB type names

  Power9 idle support from Shreyas B. Prabhu:
   - set power_save func after the idle states are initialized
   - Use PNV_THREAD_WINKLE macro while requesting for winkle
   - make hypervisor state restore a function
   - Rename idle_power7.S to idle_book3s.S
   - Rename reusable idle functions to hardware agnostic names
   - Make pnv_powersave_common more generic
   - abstraction for saving SPRs before entering deep idle states
   - Add platform support for stop instruction
   - cpuidle/powernv: Use CPUIDLE_STATE_MAX instead of MAX_POWERNV_IDLE_STATES
   - cpuidle/powernv: cleanup cpuidle-powernv.c
   - cpuidle/powernv: Add support for POWER ISA v3 idle states
   - Use deepest stop state when cpu is offlined

  Power9 PMU from Madhavan Srinivasan:
   - factor out power8 pmu macros and defines
   - factor out power8 pmu functions
   - factor out power8 __init_pmu code
   - Add power9 event list macros for generic and cache events
   - Power9 PMU support
   - Export Power9 generic and cache events to sysfs

  Power9 preliminary interrupt & PCI support from Benjamin Herrenschmidt:
   - Add XICS emulation APIs
   - Move a few exception common handlers to make room
   - Add support for HV virtualization interrupts
   - Add mechanism to force a replay of interrupts
   - Add ICP OPAL backend
   - Discover IODA3 PHBs
   - pci: Remove obsolete SW invalidate
   - opal: Add real mode call wrappers
   - Rename TCE invalidation calls
   - Remove SWINV constants and obsolete TCE code
   - Rework accessing the TCE invalidate register
   - Fallback to OPAL for TCE invalidations
   - Use the device-tree to get available range of M64's
   - Check status of a PHB before using it
   - pci: Don't try to allocate resources that will be reassigned

  Other Power9:
   - Send SIGBUS on unaligned copy and paste from Chris Smart
   - Large Decrementer support from Oliver O'Halloran
   - Load Monitor Register Support from Jack Miller

  Performance improvements from Anton Blanchard:
   - Avoid load hit store in __giveup_fpu() and __giveup_altivec()
   - Avoid load hit store in setup_sigcontext()
   - Remove assembly versions of strcpy, strcat, strlen and strcmp
   - Align hot loops of some string functions

  eBPF JIT from Naveen N. Rao:
   - Fix/enhance 32-bit Load Immediate implementation
   - Optimize 64-bit Immediate loads
   - Introduce rotate immediate instructions
   - A few cleanups
   - Isolate classic BPF JIT specifics into a separate header
   - Implement JIT compiler for extended BPF

  Operator Panel driver from Suraj Jitindar Singh:
   - devicetree/bindings: Add binding for operator panel on FSP machines
   - Add inline function to get rc from an ASYNC_COMP opal_msg
   - Add driver for operator panel on FSP machines

  Sparse fixes from Daniel Axtens:
   - make some things static
   - Introduce asm-prototypes.h
   - Include headers containing prototypes
   - Use #ifdef __BIG_ENDIAN__ #else for REG_BYTE
   - kvm: Clarify __user annotations
   - Pass endianness to sparse
   - Make ppc_md.{halt, restart} __noreturn

  MM fixes & cleanups from Aneesh Kumar K.V:
   - radix: Update LPCR HR bit as per ISA
   - use _raw variant of page table accessors
   - Compile out radix related functions if RADIX_MMU is disabled
   - Clear top 16 bits of va only on older cpus
   - Print formation regarding the the MMU mode
   - hash: Update SDR1 size encoding as documented in ISA 3.0
   - radix: Update PID switch sequence
   - radix: Update machine call back to support new HCALL.
   - radix: Add LPID based tlb flush helpers
   - radix: Add a kernel command line to disable radix
   - Cleanup LPCR defines

  Boot code consolidation from Benjamin Herrenschmidt:
   - Move epapr_paravirt_early_init() to early_init_devtree()
   - cell: Don't use flat device-tree after boot
   - ge_imp3a: Don't use the flat device-tree after boot
   - mpc85xx_ds: Don't use the flat device-tree after boot
   - mpc85xx_rdb: Don't use the flat device-tree after boot
   - Don't test for machine type in rtas_initialize()
   - Don't test for machine type in smp_setup_cpu_maps()
   - dt: Add of_device_compatible_match()
   - Factor do_feature_fixup calls
   - Move 64-bit feature fixup earlier
   - Move 64-bit memory reserves to setup_arch()
   - Use a cachable DART
   - Move FW feature probing out of pseries probe()
   - Put exception configuration in a common place
   - Remove early allocation of the SMU command buffer
   - Move MMU backend selection out of platform code
   - pasemi: Remove IOBMAP allocation from platform probe()
   - mm/hash: Don't use machine_is() early during boot
   - Don't test for machine type to detect HEA special case
   - pmac: Remove spurrious machine type test
   - Move hash table ops to a separate structure
   - Ensure that ppc_md is empty before probing for machine type
   - Move 64-bit probe_machine() to later in the boot process
   - Move 32-bit probe() machine to later in the boot process
   - Get rid of ppc_md.init_early()
   - Move the boot time info banner to a separate function
   - Move setting of {i,d}cache_bsize to initialize_cache_info()
   - Move the content of setup_system() to setup_arch()
   - Move cache info inits to a separate function
   - Re-order the call to smp_setup_cpu_maps()
   - Re-order setup_panic()
   - Make a few boot functions __init
   - Merge 32-bit and 64-bit setup_arch()

  Other new features:
   - tty/hvc: Use IRQF_SHARED for OPAL hvc consoles from Sam Mendoza-Jonas
   - tty/hvc: Use opal irqchip interface if available from Sam Mendoza-Jonas
   - powerpc: Add module autoloading based on CPU features from Alastair D'Silva
   - crypto: vmx - Convert to CPU feature based module autoloading from Alastair D'Silva
   - Wake up kopald polling thread before waiting for events from Benjamin Herrenschmidt
   - xmon: Dump ISA 2.06 SPRs from Michael Ellerman
   - xmon: Dump ISA 2.07 SPRs from Michael Ellerman
   - Add a parameter to disable 1TB segs from Oliver O'Halloran
   - powerpc/boot: Add OPAL console to epapr wrappers from Oliver O'Halloran
   - Assign fixed PHB number based on device-tree properties from Guilherme G. Piccoli
   - pseries: Add pseries hotplug workqueue from John Allen
   - pseries: Add support for hotplug interrupt source from John Allen
   - pseries: Use kernel hotplug queue for PowerVM hotplug events from John Allen
   - pseries: Move property cloning into its own routine from Nathan Fontenot
   - pseries: Dynamic add entires to associativity lookup array from Nathan Fontenot
   - pseries: Auto-online hotplugged memory from Nathan Fontenot
   - pseries: Remove call to memblock_add() from Nathan Fontenot

  cxl:
   - Add set and get private data to context struct from Michael Neuling
   - make base more explicitly non-modular from Paul Gortmaker
   - Use for_each_compatible_node() macro from Wei Yongjun
   - Frederic Barrat
   - Abstract the differences between the PSL and XSL
   - Make vPHB device node match adapter's
   - Philippe Bergheaud
   - Add mechanism for delivering AFU driver specific events
   - Ignore CAPI adapters misplaced in switched slots
   - Refine slice error debug messages
   - Andrew Donnellan
   - static-ify variables to fix sparse warnings
   - PCI/hotplug: pnv_php: export symbols and move struct types needed by cxl
   - PCI/hotplug: pnv_php: handle OPAL_PCI_SLOT_OFFLINE power state
   - Add cxl_check_and_switch_mode() API to switch bi-modal cards
   - remove dead Kconfig options
   - fix potential NULL dereference in free_adapter()
   - Ian Munsie
   - Update process element after allocating interrupts
   - Add support for CAPP DMA mode
   - Fix allowing bogus AFU descriptors with 0 maximum processes
   - Fix allocating a minimum of 2 pages for the SPA
   - Fix bug where AFU disable operation had no effect
   - Workaround XSL bug that does not clear the RA bit after a reset
   - Fix NULL pointer dereference on kernel contexts with no AFU interrupts
   - powerpc/powernv: Split cxl code out into a separate file
   - Add cxl_slot_is_supported API
   - Enable bus mastering for devices using CAPP DMA mode
   - Move cxl_afu_get / cxl_afu_put to base
   - Allow a default context to be associated with an external pci_dev
   - Do not create vPHB if there are no AFU configuration records
   - powerpc/powernv: Add support for the cxl kernel api on the real phb
   - Add support for using the kernel API with a real PHB
   - Add kernel APIs to get & set the max irqs per context
   - Add preliminary workaround for CX4 interrupt limitation
   - Add support for interrupts on the Mellanox CX4
   - Workaround PE=0 hardware limitation in Mellanox CX4
   - powerpc/powernv: Fix pci-cxl.c build when CONFIG_MODULES=n

  selftests:
   - Test unaligned copy and paste from Chris Smart
   - Load Monitor Register Tests from Jack Miller
   - Cyril Bur
   - exec() with suspended transaction
   - Use signed long to read perf_event_paranoid
   - Fix usage message in context_switch
   - Fix generation of vector instructions/types in context_switch
   - Michael Ellerman
   - Use "Delta" rather than "Error" in normal output
   - Import Anton's mmap & futex micro benchmarks
   - Add a test for PROT_SAO"

* tag 'powerpc-4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (263 commits)
  powerpc/mm: Parenthesise IS_ENABLED() in if condition
  tty/hvc: Use opal irqchip interface if available
  tty/hvc: Use IRQF_SHARED for OPAL hvc consoles
  selftests/powerpc: exec() with suspended transaction
  powerpc: Improve comment explaining why we modify VRSAVE
  powerpc/mm: Drop unused externs for hpte_init_beat[_v3]()
  powerpc/mm: Rename hpte_init_lpar() and move the fallback to a header
  powerpc/mm: Fix build break when PPC_NATIVE=n
  crypto: vmx - Convert to CPU feature based module autoloading
  powerpc: Add module autoloading based on CPU features
  powerpc/powernv/ioda: Fix endianness when reading TCEs
  powerpc/mm: Add memory barrier in __hugepte_alloc()
  powerpc/modules: Never restore r2 for a mprofile-kernel style mcount() call
  powerpc/ftrace: Separate the heuristics for checking call sites
  powerpc: Merge 32-bit and 64-bit setup_arch()
  powerpc/64: Make a few boot functions __init
  powerpc: Re-order setup_panic()
  powerpc: Re-order the call to smp_setup_cpu_maps()
  powerpc/32: Move cache info inits to a separate function
  powerpc/64: Move the content of setup_system() to setup_arch()
  ...
2016-07-30 21:01:36 -07:00
Michael Ellerman
719dbb2df7 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next
Freescale updates from Scott:

"Highlights include more 8xx optimizations, device tree updates,
and MVME7100 support."
2016-07-30 13:43:19 +10:00
Linus Torvalds
a6408f6cb6 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
 "This is the next part of the hotplug rework.

   - Convert all notifiers with a priority assigned

   - Convert all CPU_STARTING/DYING notifiers

     The final removal of the STARTING/DYING infrastructure will happen
     when the merge window closes.

  Another 700 hundred line of unpenetrable maze gone :)"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  timers/core: Correct callback order during CPU hot plug
  leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
  powerpc/numa: Convert to hotplug state machine
  arm/perf: Fix hotplug state machine conversion
  irqchip/armada: Avoid unused function warnings
  ARC/time: Convert to hotplug state machine
  clocksource/atlas7: Convert to hotplug state machine
  clocksource/armada-370-xp: Convert to hotplug state machine
  clocksource/exynos_mct: Convert to hotplug state machine
  clocksource/arm_global_timer: Convert to hotplug state machine
  rcu: Convert rcutree to hotplug state machine
  KVM/arm/arm64/vgic-new: Convert to hotplug state machine
  smp/cfd: Convert core to hotplug state machine
  x86/x2apic: Convert to CPU hotplug state machine
  profile: Convert to hotplug state machine
  timers/core: Convert to hotplug state machine
  hrtimer: Convert to hotplug state machine
  x86/tboot: Convert to hotplug state machine
  arm64/armv8 deprecated: Convert to hotplug state machine
  hwtracing/coresight-etm4x: Convert to hotplug state machine
  ...
2016-07-29 13:55:30 -07:00
Stephen Rothwell
fbef66f0ad powerpc/mm: Parenthesise IS_ENABLED() in if condition
Currently IS_ENABLED() produces an expression surrounded by parentheses,
which allows this code to compile, generating eg:

    else if (1 || 0)
        hpte_init_native();

However a change to the macro in the kbuild tree will break this in
future by removing the parentheses.

Fixes: 7353644fa9 ("powerpc/mm: Fix build break when PPC_NATIVE=n")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-28 12:39:15 +10:00
Kirill A. Shutemov
dcddffd41d mm: do not pass mm_struct into handle_mm_fault
We always have vma->vm_mm around.

Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Michael Ellerman
6364e84e85 powerpc/mm: Rename hpte_init_lpar() and move the fallback to a header
hpte_init_lpar() is part of the pseries platform, so name it as such.

Move the fallback implementation for when PSERIES=n into the header,
dropping the weak implementation. The panic() is now handled by the
calling code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-26 14:16:18 +10:00
Michael Ellerman
7353644fa9 powerpc/mm: Fix build break when PPC_NATIVE=n
The recent commit to rework the hash MMU setup broke the build when
CONFIG_PPC_NATIVE=n. Fix it by adding an IS_ENABLED() check before
calling hpte_init_native().

Removing the else clause opens the possibility that we don't set any
ops, which would probably lead to a strange crash later. So add a check
that we correctly initialised at least one member of the struct.

Fixes: 166dd7d3fb ("powerpc/64: Move MMU backend selection out of platform code")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-26 14:16:08 +10:00
Sebastian Andrzej Siewior
bdab88e006 powerpc/numa: Convert to hotplug state machine
Install the callbacks via the state machine. On the boot cpu the callback is
invoked manually because cpuhp is not up yet and everything must be
preinitialized before additional CPUs are up.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Cc: Anton Blanchard <anton@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160718140727.GA13132@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-22 21:53:17 +02:00
Sukadev Bhattiprolu
0eab46be21 powerpc/mm: Add memory barrier in __hugepte_alloc()
__hugepte_alloc() uses kmem_cache_zalloc() to allocate a zeroed PTE
and proceeds to use the newly allocated PTE. Add a memory barrier to
make sure that the other CPUs see a properly initialized PTE.

Based on a fix suggested by James Dykman.

Reported-by: James Dykman <jdykman@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Tested-by: James Dykman <jdykman@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21 20:13:28 +10:00
Benjamin Herrenschmidt
7025776ed1 powerpc/mm: Move hash table ops to a separate structure
Moving probe_machine() to after mmu init will cause the ppc_md
fields relative to the hash table management to be overwritten.

Since we have essentially disconnected the machine type from
the hash backend ops, finish the job by moving them to a different
structure.

The only callback that didn't quite fix is update_partition_table
since this is not specific to hash, so I moved it to a standalone
variable for now. We can revisit later if needed.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Fix ppc64e build failure in kexec]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21 18:59:09 +10:00
Benjamin Herrenschmidt
2b4e3ad8f5 powerpc/mm/hash64: Don't test for machine type to detect HEA special case
Instead, check for FW_FEATURE_SPLPAR. This should be roughtly equivalent
as all pseries machiens that can have an HEA also support SPLPAR and
no other machine type does.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21 18:58:25 +10:00
Benjamin Herrenschmidt
5556ecf5e9 powerpc/mm/hash: Don't use machine_is() early during boot
Use the device-tree instead as we'll be moving probe_machine()
out of early_setup

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21 18:58:21 +10:00
Benjamin Herrenschmidt
166dd7d3fb powerpc/64: Move MMU backend selection out of platform code
We move it into early_mmu_init() based on firmware features. For PS3,
we have to move the setting of these into early_init_devtree().

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21 18:56:38 +10:00
Benjamin Herrenschmidt
c40785ad30 powerpc/dart: Use a cachable DART
Instead of punching a hole in the linear mapping, just use normal
cachable memory, and apply the flush sequence documented in the
CPC625 (aka U3) user manual.

This allows us to remove quite a bit of code related to the early
allocation of the DART and the hole in the linear mapping. We can
also get rid of the copy of the DART for suspend/resume as the
original memory can just be saved/restored now, as long as we
properly sync the caches.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Integrate dart_init() fix to return ENODEV when DART disabled]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21 18:55:54 +10:00
Kevin Hao
27d1149667 powerpc/32: Remove RELOCATABLE_PPC32
It is seldom used in the kernel code and can be easily replaced by
either RELOCATABLE or PPC32. So there is no reason to keep a separate
kernel option for this.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-19 20:17:07 +10:00
Aneesh Kumar K.V
912cc87a65 powerpc/mm/radix: Add LPID based tlb flush helpers
We add a tlb flush variant, to flush LPID mappings.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:55 +10:00
Aneesh Kumar K.V
83209bc861 powerpc/mm/radix: Update machine call back to support new HCALL.
This update the machine dep callback such that we can use the same
callback to register process table. The interface is updated such that
we can easily call H_REGISTER_PROC_TBL hcall. The HCALL itself is
introduced in a later patch.

No functionality change introduced by this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:54 +10:00
Aneesh Kumar K.V
09cf5bcb0c powerpc/mm/radix: Update PID switch sequence
Update the PID switch as per ISA doc. slbia is needed in radix to
invalidate any implementation specific lookaside information.
We use the .long format due to build errors with the below compiler
version.

gcc (Ubuntu 5.3.1-14ubuntu2.1) 5.3.1 20160413
GNU assembler (GNU Binutils for Ubuntu) 2.26

CC      arch/powerpc/mm//mmu_context_book3s64.o
{standard input}: Assembler messages:
{standard input}:506: Error: junk at end of line: `0x7'
scripts/Makefile.build:291: recipe for target 'arch/powerpc/mm//mmu_context_book3s64.o' failed
make[1]: *** [arch/powerpc/mm//mmu_context_book3s64.o] Error 1
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:53 +10:00
Aneesh Kumar K.V
4b7a350480 powerpc/mm/hash: Update SDR1 size encoding as documented in ISA 3.0
ISA 3.0 document hash table size in bytes = 2^(HTABSIZE + 18)

No functionality change by this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:53 +10:00
Aneesh Kumar K.V
56547411a0 powerpc/mm: Print formation regarding the the MMU mode
This helps in easily identifying the MMU mode with which the kernel
is operating.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:52 +10:00
Aneesh Kumar K.V
accfad7d0a powerpc/mm: Clear top 16 bits of va only on older cpus
As per ISA, we need to do this only for architecture version 2.02 and
earlier. This continued to work even for 2.07. But let's not do this for
anything after 2.02. ISA 3.0 requires these top bits to be not cleared.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:52 +10:00
Aneesh Kumar K.V
bf16cdf48a powerpc/mm/radix: Update LPCR HR bit as per ISA
PowerISA 3.0 requires the MMU mode (radix vs. hash) of the hypervisor
to be mirrored in the LPCR register, in addition to the partition table.
This is done to avoid fetching from the table when deciding, among other
things, how to perform transitions to HV mode on some interrupts.
So let's set it up appropriately

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:50 +10:00
Balbir Singh
8cd6d3c23e powerpc/mm: Fix .long's in tlb-radix.c to more meaningful
The .longs with the shifts are harder to read, use more meaningful names
for the opcodes. PPC_TLBIE_5 is introduced for the 5 opcode variation of
the instruction due to an existing op-code for the 2 opcode variant.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:50 +10:00
Michael Ellerman
b5f1bf48f2 powerpc fixes for 4.7 #5
- tm: Always reclaim in start_thread() for exec() class syscalls from Cyril Bur
  - tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 from Michael Neuling
  - eeh: Fix wrong argument passed to eeh_rmv_device() from Gavin Shan
  - Initialise pci_io_base as early as possible from Darren Stevens
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXeEmsAAoJEFHr6jzI4aWAwMMQAKs/u9rwB3gpOkNJSHajN1Dd
 kdqDufzLxLDwbWnMfqM1+bcO2EOjPhKbtpbzhG6oeiET8undRRoLsjHS5rZeYK5h
 cviRPEJ/Yz8ZWaIgFGI8+02gXwU0MJhuTY8NPexXmsh4FRdKYwEuCIJShl30lg22
 P7UrJ2SCNM+H/uZyS07B7thiwBeAKSp6VkLTpuW/QDz2j1ra/F22dTh7c0Agdahd
 INAMAnh9nYeuMVYn4XjOOlQ07JnBTuf1/W5Wxlw4i/86rVq+Hy8zh5r1X52oysR5
 lZl63B9q3agKG9cc9lSN2ibTDVerlFMwB2QysX2a6Uy7+y2SB3hS7VS1RTXCh3hg
 /omApGGVW3Hh+E2CuKfFLQySU55NRpLAoTGravGr/KsH4wZP/n/fkrctldCrqm7P
 sTPT52+t+iJQk4fiskRY3yQ7DTTnt3rTC8MJRGqvLuCheolLll4NQaWOF75AJP+7
 WFWtC4QHOTPERMkhqLnZDG2vNuDg1H8chuZ2+PxtIs6G1vuOEun+MTZAYh4u6XWE
 bAIT9rV3xBdE17bzYOQz7lU1y7yNVtP7xkm0HIOAHlU4gUrjQp5u8F3TnPW3/M0m
 8GeaZdrPjhsaNg31YZODAeM8Ddf+N9d2a2VPIr/fzytURhMe0ss3Z/MdMoYRATab
 Lh1o+G3gDo9MVaphoJ3w
 =oEAY
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.7-5' into next

Pull in the fixes we sent during 4.7, we have code we want to merge into
next that depends on some of them.
2016-07-15 14:57:47 +10:00
Christophe Leroy
62f64b49d0 powerpc/8xx: add CONFIG_PIN_TLB_IMMR
CONFIG_PIN_TLB maps IMMR area and the first 24 Mbytes of memory.
In some circunstances it might be more interesting to not map
IMMR but map 32 Mbytes of memory instead.

Therefore we add config option CONFIG_PIN_TLB_IMMR to select if
IMMR shall be pinned or not, hence whether we pin 24 or 32 Mbytes of RAM

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-07-09 02:02:48 -05:00
Christophe Leroy
4ad274502f powerpc/8xx: Rework CONFIG_PIN_TLB handling
On recent kernels, with some debug options like for instance
CONFIG_LOCKDEP, the BSS requires more than 8M memory, allthough
the kernel code fits in the first 8M.
Today, it is necessary to activate CONFIG_PIN_TLB to get more than 8M
at startup, allthough pinning TLB is not necessary for that.

We could have inconditionaly mapped 16 or 24M bytes at startup
but some old hardware only have 8M and mapping non-existing RAM
would be an issue due to speculative accesses.

With the preceding patch however, the TLB entries are populated on
demand. By setting up the TLB miss handler to handle up to 24M until
the handler is patched for the entire memory space, it is possible
to allow access up to more memory without mapping non-existing RAM.

It is therefore not needed anymore to map memory data at all
at startup. It will be handled by the TLB miss handler.

One might still want to PIN the IMMR and the first 24M of RAM.
It is now possible to do it in the C memory initialisation
functions. In addition, we now know how much memory we have
when we do it, so we are able to adapt the pining to the
real amount of memory available. So boards with less than 24M
can now also benefit from PIN_TLB.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-07-09 02:02:48 -05:00
Christophe Leroy
bb7f380849 powerpc/8xx: Don't use page table for linear memory space
Instead of using the first level page table to define mappings for
the linear memory space, we can use direct mapping from the TLB
handling routines. This has several advantages:
* No need to read the tables at each TLB miss
* No issue in 16k pages mode where the 1st level table maps 64 Mbytes

The size of the available linear space is known at system startup.
In order to avoid data access at each TLB miss to know the memory
size, the TLB routine is patched at startup with the proper size

This patch provides a 10%-15% improvment of TLB miss handling for
kernel addresses

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-07-09 02:02:48 -05:00
Christophe Leroy
4badd43ae4 powerpc/8xx: Map IMMR area with 512k page at a fixed address
Once the linear memory space has been mapped with 8Mb pages, as
seen in the related commit, we get 11 millions DTLB missed during
the reference 600s period. 77% of the misses are on user addresses
and 23% are on kernel addresses (1 fourth for linear address space
and 3 fourth for virtual address space)

Traditionaly, each driver manages one computer board which has its
own components with its own memory maps.
But on embedded chips like the MPC8xx, the SOC has all registers
located in the same IO area.

When looking at ioremaps done during startup, we see that
many drivers are re-mapping small parts of the IMMR for their own use
and all those small pieces gets their own 4k page, amplifying the
number of TLB misses: in our system we get 0xff000000 mapped 31 times
and 0xff003000 mapped 9 times.

Even if each part of IMMR was mapped only once with 4k pages, it would
still be several small mappings towards linear area.

This patch maps the IMMR with a single 512k page.

With this patch applied, the number of DTLB misses during the 10 min
period is reduced to 11.8 millions for a duration of 5.8s, which
represents 2% of the non-idle time hence yet another 10% reduction.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-07-09 02:02:48 -05:00
Michael Ellerman
fc022fdf41 powerpc/kernel: Drop unused extern for current_set
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-07 22:03:10 +10:00
Benjamin Herrenschmidt
fecbfabe1d powerpc: Fix build with CONFIG_MEMORY_HOTPLUG on some configs
For memory hotplug to work, the MMU code needs to provide the functions
create_section_mapping() and remove_section_mapping() to respectively
map and unmap portions of the linear mapping.

At the moment only hash64 provides these, so we provide weak stubs that
just error out. This fixes the build with configurations such as 64-bit
BookE with CONFIG_MEMORY_HOTPLUG enabled.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-07 16:33:27 +10:00
Oliver O'Halloran
faf7882962 powerpc/mm: Add a parameter to disable 1TB segs
This patch adds the kernel command line parameter "no_tb_segs" which
forces the kernel to use 256MB rather than 1TB segments. Forcing the use
of 256MB segments makes it considerably easier to test code that depends
on an SLB miss occurring.

Suggested-by: Michael Neuling <mikey@neuling.org>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-05 23:58:54 +10:00
Linus Torvalds
70bd68d7f7 powerpc fixes for 4.7 #5
- tm: Always reclaim in start_thread() for exec() class syscalls from Cyril Bur
  - tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 from Michael Neuling
  - eeh: Fix wrong argument passed to eeh_rmv_device() from Gavin Shan
  - Initialise pci_io_base as early as possible from Darren Stevens
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXeEmsAAoJEFHr6jzI4aWAwMMQAKs/u9rwB3gpOkNJSHajN1Dd
 kdqDufzLxLDwbWnMfqM1+bcO2EOjPhKbtpbzhG6oeiET8undRRoLsjHS5rZeYK5h
 cviRPEJ/Yz8ZWaIgFGI8+02gXwU0MJhuTY8NPexXmsh4FRdKYwEuCIJShl30lg22
 P7UrJ2SCNM+H/uZyS07B7thiwBeAKSp6VkLTpuW/QDz2j1ra/F22dTh7c0Agdahd
 INAMAnh9nYeuMVYn4XjOOlQ07JnBTuf1/W5Wxlw4i/86rVq+Hy8zh5r1X52oysR5
 lZl63B9q3agKG9cc9lSN2ibTDVerlFMwB2QysX2a6Uy7+y2SB3hS7VS1RTXCh3hg
 /omApGGVW3Hh+E2CuKfFLQySU55NRpLAoTGravGr/KsH4wZP/n/fkrctldCrqm7P
 sTPT52+t+iJQk4fiskRY3yQ7DTTnt3rTC8MJRGqvLuCheolLll4NQaWOF75AJP+7
 WFWtC4QHOTPERMkhqLnZDG2vNuDg1H8chuZ2+PxtIs6G1vuOEun+MTZAYh4u6XWE
 bAIT9rV3xBdE17bzYOQz7lU1y7yNVtP7xkm0HIOAHlU4gUrjQp5u8F3TnPW3/M0m
 8GeaZdrPjhsaNg31YZODAeM8Ddf+N9d2a2VPIr/fzytURhMe0ss3Z/MdMoYRATab
 Lh1o+G3gDo9MVaphoJ3w
 =oEAY
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - tm: Always reclaim in start_thread() for exec() class syscalls from
   Cyril Bur

 - tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 from Michael
   Neuling

 - eeh: Fix wrong argument passed to eeh_rmv_device() from Gavin Shan

 - Initialise pci_io_base as early as possible from Darren Stevens

* tag 'powerpc-4.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc: Initialise pci_io_base as early as possible
  powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
  powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()
  powerpc/tm: Always reclaim in start_thread() for exec() class syscalls
2016-07-02 17:47:54 -07:00
Darren Stevens
bfa37087aa powerpc: Initialise pci_io_base as early as possible
Commit d6a9996e84 ("powerpc/mm: vmalloc abstraction in preparation for
radix") turned kernel memory and IO addresses from #defined constants to
variables initialised at runtime.

On PA6T (pasemi) systems the setup_arch() machine call initialises the
onboard PCI-e root-ports, and uses pci_io_base to do this, which is now
before its value has been set, resulting in a panic early in boot before
console IO is initialised.

Move the pci_io_base initialisation to the same place as vmalloc ranges
are set (hash__early_init_mmu()/radix__early_init_mmu()) - this is the
earliest possible place we can initialise it.

Fixes: d6a9996e84 ("powerpc/mm: vmalloc abstraction in preparation for radix")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Darren Stevens <darren@stevens-zone.net>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Add #ifdef CONFIG_PCI, massage change log slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-30 16:52:29 +10:00
Linus Torvalds
2f6e97477b powerpc fixes for 4.7 #4
- mm/radix: Update to tlb functions ric argument from Aneesh Kumar K.V
  - mm/radix: Flush page walk cache when freeing page table from Aneesh Kumar K.V
  - mm/hash: Use the correct PPP mask when updating HPTE from Aneesh Kumar K.V
  - mm/hash: Don't add memory coherence if cache inhibited is set from Aneesh Kumar K.V
  - mm/radix: Update Radix tree size as per ISA 3.0 from Aneesh Kumar K.V
  - eeh: Fix invalid cached PE primary bus from Gavin Shan
  - Fix faults caused by radix patching of SLB miss handler from Michael Ellerman
  - bpf/jit: Disable classic BPF JIT on ppc64le from Naveen N. Rao
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXbmITAAoJEFHr6jzI4aWANXwQAIUzjKcLpyQWEKwOKfMqBT5T
 EfsWDqJA/J3mYKNcZyiB7qv1NVPPkU9DSBK0OaAKwYdg5YWKDBl6R3mW+j4di0bP
 SkFACCyE2WbLTCiz5fzd8l974RUh5jKQpIrObp4/8xp40d0vsyAzz4J7d4HVRsrr
 BnoTS/KmytsaDQls5kYArxhW6U+Shag586Au1hNt3SS/be8lCNEXLfa3ltCr7WLJ
 k+xM0KM5kpO9/OK40A64TH7xUZKQIgPMUR5Ct43IJhMeHNnQctLmGQQjRWTrajv1
 K/TfrYwCl66xzKaH5G3MKJgqJAJm1LTwGs+2aOn91x5hPrbmW+bLqr1Mm0ukjROz
 oaANO5fgEQjl0JRGCNAhLHvaoqJX6v5/7GbmFRoaigX4UKJ63nK1ABiwAgKDGnyj
 OchwwJywU5UIX/+9Qpig3CxQNhEV33Nnp8t+dsg8CPd9o/G0mIe0QP1eGdhD09mM
 X9eMfN08hLj5ERKvlpW0rrq1b/wizOGmUXbmt02HZi7iLNsyQMwShiOvwOaAvH6/
 SzEFBJdp11jNoe4GJDt5rH4HlnTnTAYwcLFMTDCCPdJXy7voI/J+MaAmG89S30dQ
 ph0+4v/8K2N0VDZ7kkgi0GL1gp9ULkgtimrN5Z0R8U7qEapEW6ybvv+0Ewln742f
 SCRNVMZgzcwe3CcCKzdn
 =fxml
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "mm/radix (Aneesh Kumar K.V):
   - Update to tlb functions ric argument
   - Flush page walk cache when freeing page table
   - Update Radix tree size as per ISA 3.0

  mm/hash (Aneesh Kumar K.V):
   - Use the correct PPP mask when updating HPTE
   - Don't add memory coherence if cache inhibited is set

  eeh (Gavin Shan):
   - Fix invalid cached PE primary bus

  bpf/jit (Naveen N. Rao):
   - Disable classic BPF JIT on ppc64le

  .. and fix faults caused by radix patching of SLB miss handler
  (Michael Ellerman)"

* tag 'powerpc-4.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/bpf/jit: Disable classic BPF JIT on ppc64le
  powerpc: Fix faults caused by radix patching of SLB miss handler
  powerpc/eeh: Fix invalid cached PE primary bus
  powerpc/mm/radix: Update Radix tree size as per ISA 3.0
  powerpc/mm/hash: Don't add memory coherence if cache inhibited is set
  powerpc/mm/hash: Use the correct PPP mask when updating HPTE
  powerpc/mm/radix: Flush page walk cache when freeing page table
  powerpc/mm/radix: Update to tlb functions ric argument
2016-06-25 06:01:48 -07:00
Michal Hocko
2379a23e34 powerpc: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.

{pud,pmd}_alloc_one are allocating from {PGT,PUD}_CACHE initialized in
pgtable_cache_init which doesn't have larger than sizeof(void *) << 12
size and that fits into !costly allocation request size.

PGALLOC_GFP is used only in radix__pgd_alloc which uses either order-0
or order-4 requests.  The first one doesn't need the flag while the
second does.  Drop __GFP_REPEAT from PGALLOC_GFP and add it for the
order-4 one.

This means that this flag has never been actually useful here because it
has always been used only for PAGE_ALLOC_COSTLY requests.

Link: http://lkml.kernel.org/r/1464599699-30131-12-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Michal Hocko
32d6bd9059 tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
This is the third version of the patchset previously sent [1].  I have
basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
rid of superfluous gfp flags" which went through dm tree.  I am sending
it now because it is tree wide and chances for conflicts are reduced
considerably when we want to target rc2.  I plan to send the next step
and rename the flag and move to a better semantic later during this
release cycle so we will have a new semantic ready for 4.8 merge window
hopefully.

Motivation:

While working on something unrelated I've checked the current usage of
__GFP_REPEAT in the tree.  It seems that a majority of the usage is and
always has been bogus because __GFP_REPEAT has always been about costly
high order allocations while we are using it for order-0 or very small
orders very often.  It seems that a big pile of them is just a
copy&paste when a code has been adopted from one arch to another.

I think it makes some sense to get rid of them because they are just
making the semantic more unclear.  Please note that GFP_REPEAT is
documented as

* __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt

* _might_ fail.  This depends upon the particular VM implementation.
  while !costly requests have basically nofail semantic.  So one could
  reasonably expect that order-0 request with __GFP_REPEAT will not loop
  for ever.  This is not implemented right now though.

I would like to move on with __GFP_REPEAT and define a better semantic
for it.

  $ git grep __GFP_REPEAT origin/master | wc -l
  111
  $ git grep __GFP_REPEAT | wc -l
  36

So we are down to the third after this patch series.  The remaining
places really seem to be relying on __GFP_REPEAT due to large allocation
requests.  This still needs some double checking which I will do later
after all the simple ones are sorted out.

I am touching a lot of arch specific code here and I hope I got it right
but as a matter of fact I even didn't compile test for some archs as I
do not have cross compiler for them.  Patches should be quite trivial to
review for stupid compile mistakes though.  The tricky parts are usually
hidden by macro definitions and thats where I would appreciate help from
arch maintainers.

[1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org

This patch (of 19):

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.  Yet we
have the full kernel tree with its usage for apparently order-0
allocations.  This is really confusing because __GFP_REPEAT is
explicitly documented to allow allocation failures which is a weaker
semantic than the current order-0 has (basically nofail).

Let's simply drop __GFP_REPEAT from those places.  This would allow to
identify place which really need allocator to retry harder and formulate
a more specific semantic for what the flag is supposed to do actually.

Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Crispin <blogic@openwrt.org>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Aneesh Kumar K.V
b23d9c5b9c powerpc/mm/radix: Update Radix tree size as per ISA 3.0
ISA 3.0 updated it to be encoded as Radix tree size = 2^(RTS + 31). We
have it encoded as 2^(RTS + 28). Add a helper with the correct encoding
and use it instead of opencoding.

Fixes: 2bfd65e45e ("powerpc/mm/radix: Add radix callbacks for early init routines")
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-17 19:50:55 +10:00
Aneesh Kumar K.V
e568006b9d powerpc/mm/hash: Don't add memory coherence if cache inhibited is set
H_ENTER hcall handling in qemu had assumptions that a cache inhibited
hpte entry won't have memory conference set. Also older kernel
mentioned that some version of pHyp required this (the code removed
by the below commit says:

    /* Make pHyp happy */
    if ((rflags & _PAGE_NO_CACHE) && !(rflags & _PAGE_WRITETHRU))
            hpte_r &= ~HPTE_R_M;

But with older kernel we had some inconsistent memory conherence
mapping. We always enabled memory conherence in the page fault path and
removed memory conherence is _PAGE_NO_CACHE was set when we mapped the
page via htab_bolt_mapping. The commit mentioned below tried to
consolidate that by always enabling memory conherence. But as mentioned
above that breaks Qemu H_ENTER handling.

This patch update this such that we enable memory conherence only if
cache inhibited is not set and bring fault handling, lpar and bolt
mapping in sync.

Fixes: commit 30bda41aba4e("powerpc/mm: Drop WIMG in favour of new constant")
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-17 19:47:51 +10:00
Oliver O'Halloran
3079abe555 powerpc/mm: Ensure "special" zones are empty
The mm zone mechanism was traditionally used by arch specific code to
partition memory into allocation zones. However there are several zones
that are managed by the mm subsystem rather than the architecture. Most
architectures set the max PFN of these special zones to zero, however on
powerpc we set them to ~0ul. This, in conjunction with a bug in
free_area_init_nodes() results in all of system memory being placed in
ZONE_DEVICE when enabled. Device memory cannot be used for regular kernel
memory allocations so this will cause a kernel panic at boot. Given the
planned addition of more mm managed zones (ZONE_CMA) we should aim to be
consistent with every other architecture and set the max PFN for these
zones to zero.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-16 16:03:21 +10:00
Bharata B Rao
45b64ee649 powerpc/numa: Fix multiple bugs in memory_hotplug_max()
memory_hotplug_max() uses hot_add_drconf_memory_max() to get maxmimum
addressable memory by referring to ibm,dyanamic-memory property. There
are three problems with the current approach:

1 hot_add_drconf_memory_max() assumes that ibm,dynamic-memory includes
  all the LMBs of the guest, but that is not true for PowerKVM which
  populates only DR LMBs (LMBs that can be hotplugged/removed) in that
  property.
2 hot_add_drconf_memory_max() multiplies lmb-size with lmb-count to arrive
  at the max possible address. Since ibm,dynamic-memory doesn't include
  RMA LMBs, the address thus obtained will be less than the actual max
  address. For example, if max possible memory size is 32G, with lmb-size
  of 256MB there can be 127 LMBs in ibm,dynamic-memory (1 LMB for RMA
  which won't be present here).  hot_add_drconf_memory_max() would then
  return the max addressable memory as 127 * 256MB = 31.75GB, the max
  address should have been 32G which is what ibm,lrdr-capacity shows.
3 In PowerKVM, there can be a gap between the end of boot time RAM and
  beginning of hotplug RAM area. So just multiplying lmb-count with
  lmb-size will not provide the correct max possible address for PowerKVM.

This patch fixes 1 by using ibm,lrdr-capacity property to return the max
addressable memory whenever the property is present. Then it fixes 2 & 3
by fetching the address of the last LMB in ibm,dynamic-memory property.

Fixes: cd34206e94 ("powerpc: Add memory_hotplug_max()")
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-14 16:06:13 +10:00
Bharata B Rao
e70bd3ae91 powerpc/numa: Fix whitespace in hot_add_drconf_memory_max()
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-14 16:06:05 +10:00
Michael Ellerman
027dfac694 powerpc: Various typo fixes
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-14 13:58:26 +10:00
Aneesh Kumar K.V
8550e2fa34 powerpc/mm/hash: Use the correct PPP mask when updating HPTE
With commit e58e87adc8 "powerpc/mm: Update _PAGE_KERNEL_RO" we now
use all the three PPP bits. The top bit is now used to have a PPP value
of 0b110 which will be mapped to kernel read only. When updating the
hpte entry use right mask such that we update the 63rd bit (top 'P' bit)
too.

Prior to e58e87adc8 we didn't support KERNEL_RO at all (it was ==
KERNEL_RW), so this isn't a regression as such.

Fixes: e58e87adc8 ("powerpc/mm: Update _PAGE_KERNEL_RO")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-14 13:54:51 +10:00
Aneesh Kumar K.V
a145abf12c powerpc/mm/radix: Flush page walk cache when freeing page table
Even though a tlb_flush() does a flush with invalidate all cache,
we can end up doing an RCU page table free before calling tlb_flush().
That means we can have page walk cache entries even after we free the
page table pages. This can result in us doing wrong page table walk.

Avoid this by doing pwc flush on every page table free. We can't batch
the pwc flush, because the rcu call back function where we free the
page table pages doesn't have information of the mmu gather. Thus we
have to do a pwc on every page table page freed.

Note: I also removed the dummy tlb_flush_pgtable call functions for
hash 32.

Fixes: 1a472c9dba ("powerpc/mm/radix: Add tlbflush routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-10 16:14:52 +10:00
Aneesh Kumar K.V
36194812a4 powerpc/mm/radix: Update to tlb functions ric argument
Radix invalidate control (RIC) is used to control which cache to flush
using tlb instructions. When doing a PID flush, we currently flush
everything including page walk cache. For address range flush, we flush
only the TLB. In the next patch, we add support for flushing only the
page walk cache.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-10 16:14:43 +10:00
Aneesh Kumar K.V
3b6d1eb7ea powerpc/mm/hash: Compute the segment size correctly for ISA 3.0
PowerISA 3.0 encodes the segment size in the second half of hash page
table entry. Update hpte_decode() accordingly.

Fixes: 50de596de8 ("powerpc/mm/hash: Add support for Power9 Hash")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-08 14:36:22 +10:00
Aneesh Kumar K.V
9690c15742 powerpc/mm/radix: Fix always false comparison against MMU_NO_CONTEXT
In some of the radix TLB flush routines, we use a local to store the
mm->context.id, AKA the PID.

Currently we use an int, but the PID is unsigned long, so large values
of PID will be truncated. In particular MMU_NO_CONTEXT is -1, which
means all our comparisons against that value can never be true.

This means we'll issue TLB flushes when we shouldn't on radix enabled
machines.

Fix it by using an unsigned long for the local. Discovered by Coverity.

Fixes: 1a472c9dba ("powerpc/mm/radix: Add tlbflush routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Write change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-08 13:56:53 +10:00
Aneesh Kumar K.V
157d4d0620 powerpc/mm/radix: Add missing tlb flush
This should not have any impact on hash, because hash does tlb
invalidate with every pte update and we don't implement
flush_tlb_* functions for hash. With radix we should make an explicit
call to flush tlb outside pte update.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-01 13:47:34 +10:00
Aneesh Kumar K.V
dc47c0c1f8 powerpc/mm/hash: Fix the reference bit update when handling hash fault
When we converted the asm routines to C functions, we missed updating
HPTE_R_R based on _PAGE_ACCESSED. ASM code used to copy over the lower
bits from pte via.

andi.	r3,r30,0x1fe		/* Get basic set of flags */

We also update the code such that we won't update the Change bit ('C'
bit) always. This was added by commit c5cf0e30bf ("powerpc: Fix
buglet with MMU hash management").

With hash64, we need to make sure that hardware doesn't do a pte update
directly. This is because we do end up with entries in TLB with no hash
page table entry. This happens because when we find a hash bucket full,
we "evict" a more/less random entry from it. When we do that we don't
invalidate the TLB (hpte_remove) because we assume the old translation
is still technically "valid". For more info look at commit
0608d692463("powerpc/mm: Always invalidate tlb on hpte invalidate and
update").

Thus it's critical that valid hash PTEs always have reference bit set
and writeable ones have change bit set. We do this by hashing a
non-dirty linux PTE as read-only and always setting _PAGE_ACCESSED (and
thus R) when hashing anything else in. Any attempt by Linux at clearing
those bits also removes the corresponding hash entry.

Commit 5cf0e30bf3d8 did that for 'C' bit by enabling 'C' bit always.
We don't really need to do that because we never map a RW pte entry
without setting 'C' bit. On READ fault on a RW pte entry, we still map
it READ only, hence a store update in the page will still cause a hash
pte fault.

This patch reverts the part of commit c5cf0e30bf ("[PATCH] powerpc:
Fix buglet with MMU hash management") and retain the updatepp part.

- If we hit the updatepp path on native, the old code without that
  commit, would fail to set C bcause native_hpte_updatepp()
  was implemented to filter the same bits as H_PROTECT and not let C
  through thus we would "upgrade" a RO HPTE to RW without setting C
  thus causing the bug. So the real fix in that commit was the change
  to native_hpte_updatepp

Fixes: 89ff725051 ("powerpc/mm: Convert __hash_page_64K to C")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-01 13:47:34 +10:00
Aneesh Kumar K.V
d6c886006c powerpc/mm/radix: Update LPCR only if it is powernv
LPCR cannot be updated when running in guest mode.

Fixes: 2bfd65e45e ("powerpc/mm/radix: Add radix callbacks for early init routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-01 13:47:34 +10:00
Linus Torvalds
c04a588029 powerpc updates for 4.7
Highlights:
  - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
  - Live patching support for ppc64le (also merged via livepatching.git)
 
 Various cleanups & minor fixes from:
  - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
    Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart
    Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
    Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta,
    Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin
    Rothberg, Vipin K Parashar.
 
 General:
  - Update LMB associativity index during DLPAR add/remove from Nathan Fontenot
  - Fix branching to OOL handlers in relocatable kernel from Hari Bathini
  - Add support for userspace Power9 copy/paste from Chris Smart
  - Always use STRICT_MM_TYPECHECKS from Michael Ellerman
  - Add mask of possible MMU features from Michael Ellerman
 
 PCI:
  - Enable pass through of NVLink to guests from Alexey Kardashevskiy
  - Cleanups in preparation for powernv PCI hotplug from Gavin Shan
  - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
  - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
  - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G. Piccoli
  - Remove the dependency on EEH struct in DDW mechanism from Guilherme G. Piccoli
 
 selftests:
  - Test cp_abort during context switch from Chris Smart
  - Add several tests for transactional memory support from Rashmica Gupta
 
 perf:
  - Add support for sampling interrupt register state from Anju T
  - Add support for unwinding perf-stackdump from Chandan Kumar
 
 cxl:
  - Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud
  - Allow initialization on timebase sync failures from Frederic Barrat
  - Increase timeout for detection of AFU mmio hang from Frederic Barrat
  - Handle num_of_processes larger than can fit in the SPA from Ian Munsie
  - Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie
  - Add kernel API to allow a context to operate with relocate disabled from Ian Munsie
  - Check periodically the coherent platform function's state from Christophe Lombard
 
 Freescale:
  - Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum
    workaround, and a kconfig dependency fix."
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXPsGzAAoJEFHr6jzI4aWAVoAP/iKdrDe0eYHlVAE9SqnbsiZs
 lgDxdsC8P3fsmP1G9o/HkKhC82zHl/La8Ztz8dtqa+LkSzbfliWP1ztJsI7GsBFo
 tyCKzWnX9Rwvd3meHu/o/SQ29TNLm/PbPyyRqpj5QPbJ8XCXkAXR7ZZZqjvcMsJW
 /AgIr7Cgf53tl9oZzzl/c7CnNHhMq+NBdA71vhWtUx+T97wfJEGyKW6HhZyHDbEU
 iAki7fu77ZpEqC/Fh9swf0dCGBJ+a132NoMVo0AdV7EQLznUYlQpQEqa+1PyHZOP
 /ArOzf2mDg6m3PfCo1eiB07v8PnVZ3llEUbVAJNg3GUxbE4SHrqq/kwm0iElm3p/
 DvFxerCwdX9vmskJX4wDs+pSZRabXYj9XVMptsgFzA4joWrqqb7mBHqaort88YcY
 YSljEt1bHyXmiJ+dBya40qARsWUkCVN7ZgEzdxckq0KI3w7g2tqpqIbO2lClWT6t
 B3GpqQ4jp34+d1M14FB91fIGK7tMvOhSInE0Mv9+tPvRsepXqiiU/SwdAtRlr3m2
 zs/K+4FYcVjJ3Rmpgc+tI38PbZxHe212I35YN6L1LP+4ZfAtzz0NyKdooTIBtkbO
 19pX4WbBjKq8zK+YutrySncBIrbnI6VjW51vtRhgVKZliPFO/6zKagyU6FbxM+E5
 udQES+t3F/9gvtxgxtDe
 =YvyQ
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights:
   - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
   - Live patching support for ppc64le (also merged via livepatching.git)

  Various cleanups & minor fixes from:
   - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
     Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
     Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
     Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
     Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
     Bauermann, Valentin Rothberg, Vipin K Parashar.

  General:
   - Update LMB associativity index during DLPAR add/remove from Nathan
     Fontenot
   - Fix branching to OOL handlers in relocatable kernel from Hari Bathini
   - Add support for userspace Power9 copy/paste from Chris Smart
   - Always use STRICT_MM_TYPECHECKS from Michael Ellerman
   - Add mask of possible MMU features from Michael Ellerman

  PCI:
   - Enable pass through of NVLink to guests from Alexey Kardashevskiy
   - Cleanups in preparation for powernv PCI hotplug from Gavin Shan
   - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
   - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
   - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
     from Guilherme G Piccoli
   - Remove the dependency on EEH struct in DDW mechanism from Guilherme
     G Piccoli

  selftests:
   - Test cp_abort during context switch from Chris Smart
   - Add several tests for transactional memory support from Rashmica
     Gupta

  perf:
   - Add support for sampling interrupt register state from Anju T
   - Add support for unwinding perf-stackdump from Chandan Kumar

  cxl:
   - Configure the PSL for two CAPI ports on POWER8NVL from Philippe
     Bergheaud
   - Allow initialization on timebase sync failures from Frederic Barrat
   - Increase timeout for detection of AFU mmio hang from Frederic
     Barrat
   - Handle num_of_processes larger than can fit in the SPA from Ian
     Munsie
   - Ensure PSL interrupt is configured for contexts with no AFU IRQs
     from Ian Munsie
   - Add kernel API to allow a context to operate with relocate disabled
     from Ian Munsie
   - Check periodically the coherent platform function's state from
     Christophe Lombard

  Freescale:
   - Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
     an erratum workaround, and a kconfig dependency fix."

* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
  powerpc/86xx: Fix PCI interrupt map definition
  powerpc/86xx: Move pci1 definition to the include file
  powerpc/fsl: Fix build of the dtb embedded kernel images
  powerpc/fsl: Fix rcpm compatible string
  powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
  powerpc/fsl-pci: Add a workaround for PCI 5 errata
  powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
  powerpc/powernv/npu: Add PE to PHB's list
  powerpc/powernv: Fix insufficient memory allocation
  powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
  Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
  powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
  powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
  powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
  powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
  Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
  powerpc/powernv/npu: Enable NVLink pass through
  powerpc/powernv/npu: Rework TCE Kill handling
  powerpc/powernv/npu: Add set/unset window helpers
  powerpc/powernv/ioda2: Export debug helper pe_level_printk()
  ...
2016-05-20 10:12:41 -07:00
Vaishali Thakkar
71bf79cc3f powerpc: mm: use hugetlb_bad_size()
Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
hugepage size is found.

Signed-off-by: Vaishali Thakkar <vaishali.thakkar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Gavin Shan
171cb719da powerpc/mm: Improve readability of update_mmu_cache()
The function is used to update the MMU with software PTE. It can
be called by data access exception handler (0x300) or instruction
access exception handler (0x400). If the function is called by
0x400 handler, the local variable @access is set to _PAGE_EXEC
to indicate the software PTE should have that flag set. When the
function is called by 0x300 handler, @access is set to zero.

This improves the readability of the function by replacing if
statements with switch. No logical changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:09 +10:00
Oliver O'Halloran
dd0b52c47a powerpc/mm: define TOP_ZONE as a constant
The zone that contains the top of memory will be either ZONE_NORMAL
or ZONE_HIGHMEM depending on the kernel config. There are two functions
that require this information and both of them use an #ifdef to set
a local variable (top_zone). This is a little silly so lets just make it
a constant.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:08 +10:00
Michael Ellerman
aac55d7573 powerpc/mm/hash64: Fix subpage protection with 4K HPTE config
With Linux page size of 64K and hardware only supporting 4K HPTE, if we
use subpage protection, we always fail for the subpage 0 as shown
below (using the selftest subpage_prot test):

  520175565:  (4520111850): Failed at 0x3fffad4b0000 (p=13,sp=0,w=0), want=fault, got=pass !
  4520890210: (4520826495): Failed at 0x3fffad5b0000 (p=29,sp=0,w=0), want=fault, got=pass !
  4521574251: (4521510536): Failed at 0x3fffad6b0000 (p=45,sp=0,w=0), want=fault, got=pass !
  4522258324: (4522194609): Failed at 0x3fffad7b0000 (p=61,sp=0,w=0), want=fault, got=pass !

This is because hash preload wrongly inserts the HPTE entry for subpage
0 without looking at the subpage protection information.

Fix it by teaching should_hash_preload() not to preload if we have
subpage protection configured for that range.

It appears this has been broken since it was introduced in 2008.

Fixes: fa28237cfc ("[POWERPC] Provide a way to protect 4k subpages when using 64k pages")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rework into should_hash_preload() to avoid build fails w/SLICES=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:05 +10:00
Michael Ellerman
8bbc9b7b00 powerpc/mm/hash64: Factor out hash preload psize check
Currently we have a check in hash_preload() against the psize, which is
only included when CONFIG_PPC_MM_SLICES is enabled. We want to expand
this check in a subsequent patch, so factor it out to allow that. As a
bonus it removes the #ifdef in the C code.

Unfortunately we can't put this in the existing CONFIG_PPC_MM_SLICES
block because it would require a forward declaration.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:04 +10:00
Aneesh Kumar K.V
62ccf5bf1f powerpc/mm/slice: Remove slice_mm_new_context()
The usage in mm mmu_context_nohash.c is bogus, because we set the
context.id value to MMU_NO_CONTEXT 4 lines previously in the same
function, meaning slice_mm_new_context() will always be true.

The book3s 64 usage was removed in the previous commit. So remove it as
unused.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:00 +10:00
Aneesh Kumar K.V
2d566537dd powerpc/mm/subpage: Initialise user psize correctly
As part of the radix support we switched Book3s64 to use a value of ~0
for MMU_NO_CONTEXT. That is because id 0 is special on radix.

However that broke the logic in init_new_context(). The code there needs
to differentiate between a newly allocated context and one inherited via
fork. Previously it worked because a newly allocated context has an id
of zero (because it was just memset() to zero), which used to match
MMU_NO_CONTEXT, and therefore slice_mm_new_context() did the right
thing.

Instead check against a context.id value of zero instead of using
slice_mm_new_context().

Without this patch we never call slice_set_user_psize(), and end up with
a slice psize value of zero and we always end up using 4K HPTE.

Fixes: 1a472c9dba ("powerpc/mm/radix: Add tlbflush routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:59 +10:00
Aneesh Kumar K.V
bde3eb6222 powerpc/mm/radix: Add radix THP callbacks
The deposited pgtable_t is a pte fragment hence we cannot use page->lru
for linking then together. We use the first two 64 bits for pte fragment
as list_head type to link all deposited fragments together. On withdraw
we properly zero then out.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:57 +10:00
Aneesh Kumar K.V
3df33f12be powerpc/mm/thp: Abstraction for THP functions
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:57 +10:00
Aneesh Kumar K.V
6a1ea36260 powerpc/mm: THP is only available on hash64 as of now
Only code movement in this patch. No functionality change.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:56 +10:00
Aneesh Kumar K.V
484837601d powerpc/mm: Add radix support for hugetlb
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:55 +10:00
Aneesh Kumar K.V
2f5f0dfd1e powerpc/mm: Fix vma_mmu_pagesize() for radix
Radix doesn't use the slice framework to find the page size. Hence use
vma to find the page size.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:54 +10:00
Aneesh Kumar K.V
5ed7ecd08a powerpc/mm: pte_frag abstraction
In this patch we make the number of pte fragments per level 4 page table
page a variable. Radix level 4 table size is 256 bytes and hence we can
have 256 fragments per level 4 page. We don't update the fragment count
in this patch. We need to do performance measurements to find the right
value for fragment count.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:54 +10:00
Aneesh Kumar K.V
a3dece6d69 powerpc/radix: Update MMU cache
With radix there is no MMU cache. Hence we don't need to do anything in
update_mmu_cache().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:53 +10:00
Aneesh Kumar K.V
d6a9996e84 powerpc/mm: vmalloc abstraction in preparation for radix
The vmalloc range differs between hash and radix config. Hence make
VMALLOC_START and related constants a variable which will be runtime
initialized depending on whether hash or radix mode is active.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Fix missing init of ioremap_bot in pgtable_64.c for ppc64e]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:53 +10:00
Aneesh Kumar K.V
4dfb88ca9b powerpc/mm: Update pte filter for radix
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:52 +10:00
Aneesh Kumar K.V
a2f41eb992 powerpc/mm: Add radix pgalloc details
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:51 +10:00
Aneesh Kumar K.V
934828edfa powerpc/mm: Make 4K and 64K use pte_t for pgtable_t
This patch switches 4K Linux page size config to use pte_t * type
instead of struct page * for pgtable_t. This simplifies the code a lot
and helps in consolidating both 64K and 4K page allocator routines. The
changes should not have any impact, because we already store physical
address in the upper level page table tree and that implies we already
do struct page * to physical address conversion.

One change to note here is we move the pgtable_page_dtor() call for
nohash to pte_fragment_free_mm(). The nohash related change is due to
the related changes in pgtable_64.c.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:51 +10:00
Aneesh Kumar K.V
74701d5947 powerpc/mm: Rename function to indicate we are allocating fragments
Only code cleanup. No functionality change.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:50 +10:00
Aneesh Kumar K.V
b5dcc60969 powerpc/mm/radix: Update PTCR on secondary CPUs
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:48 +10:00
Aneesh Kumar K.V
7a0eedeedd powerpc/mm/radix: Pick the address layout for radix config
Hash needs special get_unmapped_area() handling because of limitations
around base page size, so we have to set HAVE_ARCH_UNMAPPED_AREA.

With radix we don't have such restrictions, so we could use the generic
code. But because we've set HAVE_ARCH_UNMAPPED_AREA (for hash), we have
to re-implement the same logic as the generic code.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:47 +10:00
Aneesh Kumar K.V
177ba7c647 powerpc/mm/radix: Limit paca allocation in radix
On return from RTAS we access the paca variables and we have 64 bit
disabled. This requires us to limit paca in 32 bit range.

Fix this by setting ppc64_rma_size to first_memblock_size/1G range.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:47 +10:00
Aneesh Kumar K.V
764041e0f4 powerpc/mm/radix: Add checks in slice code to catch radix usage
Radix doesn't need slice support. Catch incorrect usage of slice code
when radix is enabled.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:53:46 +10:00
Aneesh Kumar K.V
1a472c9dba powerpc/mm/radix: Add tlbflush routines
Core kernel doesn't track the page size of the VA range that we are
invalidating. Hence we end up flushing TLB for the entire mm here. Later
patches will improve this.

We also don't flush page walk cache separetly instead use RIC=2 when
flushing TLB, because we do a MMU gather flush after freeing page table.

MMU_NO_CONTEXT is updated for hash.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:33:09 +10:00
Aneesh Kumar K.V
676012a66f powerpc/mm: Hash abstraction for tlbflush routines
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:33:08 +10:00
Aneesh Kumar K.V
f5df4b4be9 powerpc/mm: Rename mmu_context_hash64.c to mmu_context_book3s64.c
This file now contains both hash and radix specific code. Rename it to
indicate this better.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:33:07 +10:00
Aneesh Kumar K.V
7e381c0ff6 powerpc/mm/radix: Add mmu context handling callback for radix
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:33:05 +10:00
Aneesh Kumar K.V
d2adba3fd1 powerpc/mm: Abstraction for switch_mmu_context()
How we switch MMU context differs between hash and radix. For hash we
need to switch the SLB details and for radix we need to switch the PID
SPR.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:33:04 +10:00
Aneesh Kumar K.V
d9225ad923 powerpc/mm/radix: Add radix callbacks for vmemmap and map_kernel page()
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:33:03 +10:00
Aneesh Kumar K.V
31a14fae92 powerpc/mm: Abstraction for vmemmap and map_kernel_page()
For hash we create vmemmap mapping using bolted hash page table entries.
For radix we fill the radix page table. The next patch will add the
radix details for creating vmemmap mappings.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:33:02 +10:00
Aneesh Kumar K.V
2bfd65e45e powerpc/mm/radix: Add radix callbacks for early init routines
This adds routines for early setup for radix. We use device tree
property "ibm,processor-radix-AP-encodings" to find supported page
sizes. If we don't find the above we consider 64K and 4K as supported
page sizes.

We do map vmemap using 2M page size if we can. The linear mapping is
done such that we use required page size for that range. For example
memory of 3.5G is mapped such that we use 1G mapping till 3G range and
use 2M mapping for the rest.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:33:00 +10:00
Aneesh Kumar K.V
756d08d1ba powerpc/mm: Abstract early MMU init in preparation for radix
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:58 +10:00
Aneesh Kumar K.V
dd1842a2a4 powerpc/mm: Make page table size a variable
Radix and hash MMU models support different page table sizes. Make
the #defines a variable so that existing code can work with variable
sizes.

Slice related code is only used by hash, so use hash constants there. We
will replicate some of the boundary conditions with resepct to TASK_SIZE
using radix values too. Right now we do boundary condition check using
hash constants.

Swapper pgdir size is initialized in asm code. We select the max pgd
size to keep it simple. For now we select hash pgdir. When adding radix
we will switch that to radix pgdir which is 64K.

BUILD_BUG_ON check which is removed is already done in hugepage_init()
using MAYBE_BUILD_BUG_ON().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:48 +10:00
Aneesh Kumar K.V
945537df7a powerpc/mm/book3s: Rename hash specific PTE bits to carry H_ prefix
This helps to make following hash only pte bits easier.

We have kept _PAGE_CHG_MASK, _HPAGE_CHG_MASK and _PAGE_PROT_BITS as it
is in this patch eventhough they use hash specific bits. Using them in
radix as it is should be ok, because with radix we expect those bit
positions to be zero.

Only renames in this patch, no change in functionality.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:43 +10:00
Aneesh Kumar K.V
eee24b5aaf powerpc/mm: Move hash and no hash code to separate files
This patch reduces the number of #ifdefs in C code and will also help in
adding radix changes later. Only code movement in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Propagate copyrights and update GPL text]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:42 +10:00
Aneesh Kumar K.V
50de596de8 powerpc/mm/hash: Add support for Power9 Hash
PowerISA 3.0 adds a parition table indexed by LPID. Parition table
allows us to specify the MMU model that will be used for guest and host
translation.

This patch adds support with SLB based hash model (UPRT = 0). What is
required with this model is to support the new hash page table entry
format and also setup partition table such that we use hash table for
address translation.

We don't have segment table support yet.

In order to make sure we don't load KVM module on Power9 (since we don't
have kvm support yet) this patch also disables KVM on Power9.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:40 +10:00
Aneesh Kumar K.V
ff844b741e powerpc/mm: Use generic version of pmdp_clear_flush_young()
The radix variant is going to require a flush_pmd_tlb_range(). With
flush_pmd_tlb_range() added, pmdp_clear_flush_young() is the same as the
generic version. So drop the powerpc specific variant.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:35 +10:00
Aneesh Kumar K.V
30bda41aba powerpc/mm: Drop WIMG in favour of new constants
PowerISA 3.0 introduces two pte bits with the below meaning for radix:
  00 -> Normal Memory
  01 -> Strong Access Order (SAO)
  10 -> Non idempotent I/O (Cache inhibited and guarded)
  11 -> Tolerant I/O (Cache inhibited)

We drop the existing WIMG bits in the Linux page table in favour of the
above constants. We loose _PAGE_WRITETHRU with this conversion. We only
use writethru via pgprot_cached_wthru() which is used by
fbdev/controlfb.c which is Apple control display and also PPC32.

With respect to _PAGE_COHERENCE, we have been marking hpte always
coherent for some time now. htab_convert_pte_flags() always added
HPTE_R_M.

NOTE: KVM changes need closer review.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:33 +10:00
Aneesh Kumar K.V
72176dd0ad powerpc/mm: Use a helper for finding pte bits mapping I/O area
Use a helper instead of open coding with constants. A later patch will
drop the WIMG bits and use PowerISA 3.0 defines.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:32 +10:00
Aneesh Kumar K.V
e58e87adc8 powerpc/mm: Update _PAGE_KERNEL_RO
PS3 had used a PPP bit hack to implement a read only mapping in the
kernel area. Since we are bolting the ioremap area, it used the pte
flags _PAGE_PRESENT | _PAGE_USER to get a PPP value of 0x3 there by
resulting in a read only mapping. This means the area can be accessed by
user space, but kernel will never return such an address to user space.

But we can do better by implementing a read only kernel mapping using
PPP bits 0b110.

This also allows us to do read only kernel mapping for radix in later
patches.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:30 +10:00
Aneesh Kumar K.V
96270b1fc2 powerpc/mm: Remove RPN_SHIFT and RPN_SIZE
PTE_RPN_SHIFT is actually page size dependent. Even though PowerISA 3.0
expects only the lower 12 bits to be zero, we will always find the pages
to be PAGE_SHIFT aligned. In case of hash config, this also allows us to
use the additional 3 bits to track pte specific information. We need
to make sure we use these bits only for hash specific pte flags.

For both 4K and 64K config, pte now can hold 57 bits address.

Inorder to keep things simpler, drop PTE_RPN_SHIFT and PTE_RPN_SIZE and
specify the 57 bit detail explicitly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:29 +10:00
Aneesh Kumar K.V
ac29c64089 powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED
_PAGE_PRIVILEGED means the page can be accessed only by the kernel. This
is done to keep pte bits similar to PowerISA 3.0 Radix PTE format. User
pages are now marked by clearing _PAGE_PRIVILEGED bit.

Previously we allowed the kernel to have a privileged page in the lower
address range (USER_REGION). With this patch such access is denied.

We also prevent a kernel access to a non-privileged page in higher
address range (ie, REGION_ID != 0).

Both the above access scenarios should never happen.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:26 +10:00
Michael Ellerman
7e1e63c5e9 powerpc/mm: Convert pte_user() to static inline
In a subsequent patch we want to add a second definition of pte_user().
Before we do that, make the signature clear, ie. it takes a pte_t and
returns bool.

We move it up inside the existing #ifndef __ASSEMBLY__ block, but
otherwise it's a straight conversion.

Convert the call in settlbcam(), which passes an unsigned long, to pass
a pte_t.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:24 +10:00
Aneesh Kumar K.V
73a1441a9b powerpc/mm/subpage: Clear RWX bit to indicate no access
Subpage protection used to depend on the _PAGE_USER bit to implement no
access mode. This patch switches that to use _PAGE_RWX. We clear Read,
Write and Execute access from the pte instead of clearing _PAGE_USER
now. This was done so that we can switch to _PAGE_PRIVILEGED in a later
patch.

subpage_protection() returns pte bits that need to be cleared. Instead
of updating the interface to handle no-access in a separate way, it
appears simpler to clear RWX acecss to indicate no access.

We still don't insert hash ptes for no access implied by !_PAGE_RWX.
Hence we should not get PROT_FAULT with change.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:23 +10:00
Aneesh Kumar K.V
c7d54842de powerpc/mm: Use _PAGE_READ to indicate Read access
This splits the _PAGE_RW bit into _PAGE_READ and _PAGE_WRITE. It also
removes the dependency on _PAGE_USER for implying read only. Few things
to note here is that, we have read implied with write and execute
permission. Hence we should always find _PAGE_READ set on hash pte
fault.

We still can't switch PROT_NONE to !(_PAGE_RWX). Auto numa depends on
marking a prot none pte _PAGE_WRITE. (For more details look at
b191f9b106 "mm: numa: preserve PTE write permissions across a NUMA
hinting fault")

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:21 +10:00
Aneesh Kumar K.V
5dc1ef858c powerpc/mm: Use big endian Linux page tables for book3s 64
Traditionally Power server machines have used the Hashed Page Table MMU
mode. In this mode Linux manages its own tree of nested page tables,
aka. "the Linux page tables", which are not used by the hardware
directly, and software loads translations into the hash page table for
use by the hardware.

Power ISA 3.0 defines a new MMU mode, known as Radix Tree Translation,
where the hardware can directly operate on the Linux page tables.
However the hardware requires that the page tables be in big endian
format.

To accommodate this, switch the pgtable types to __be64 and add
appropriate endian conversions.

Because we will be supporting a single kernel binary that boots using
either radix or hash mode, we always store the Linux page tables big
endian, even in hash mode where they are not actually used by the
hardware.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Fix sparse errors, flesh out change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:18 +10:00
Michael Ellerman
3910a7f485 powerpc/mm: Add pte_xchg() helper
We have five locations in 64-bit hash MMU code that do a cmpxchg() of a
PTE. Currently doing it inline OK, but in a future patch we will be
converting the PTEs to __be64 in some configs. In that case we will need
casts at every cmpxchg() site in order to keep sparse happy.

So move the logic into a helper, this is a reasonably nice cleanup on
its own.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:16 +10:00
Aneesh Kumar K.V
4bece39b50 powerpc/mm: Drop PTE_ATOMIC_UPDATES from pmd_hugepage_update()
pmd_hugepage_update() is inside #ifdef CONFIG_TRANSPARENT_HUGEPAGE. THP
can only be enabled if PPC_BOOK3S_64=y && PPC_64K_PAGES=y, aka. hash64.

On hash64 we always define PTE_ATOMIC_UPDATES to 1, meaning the #ifdef
in pmd_hugepage_update() is unnecessary, so drop it.

That is also the only use of PTE_ATOMIC_UPDATES in any of the hash code,
meaning we no longer need to #define it at all in the hash headers.

Note it's still #defined and used in the nohash code.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:15 +10:00
Daniel Axtens
7f92bc5694 powerpc: sparse: Include headers for __weak symbols
Sometimes when sparse warns about undefined symbols, it isn't
because they should have 'static' added, it's because they're
overriding __weak symbols defined elsewhere, and the header has
been missed.

Fix a few of them by adding appropriate headers.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-04-12 21:05:19 +10:00
Michael Ellerman
1f4c66e805 powerpc/mm: Remove long disabled SLB code
We have a bunch of SLB related code in the tree which is there to handle
dynamic VSIDs - but currently it's all disabled at compile time. The
comments say "Keep that around for when we re-implement dynamic VSIDs".

But that was over 10 years ago (commit 3c726f8dee ("[PATCH] ppc64:
support 64k pages")). The chance that it would still work unchanged is
minimal, and in the meantime it's confusing to folks browsing/grepping
the code. If we ever want to re-instate it, it's in the git history.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
2016-04-11 20:30:40 +10:00
Sebastian Siewior
08a5bb2921 powerpc/mm: Fixup preempt underflow with huge pages
hugepd_free() used __get_cpu_var() once. Nothing ensured that the code
accessing the variable did not migrate from one CPU to another and soon
this was noticed by Tiejun Chen in 94b09d7554 ("powerpc/hugetlb:
Replace __get_cpu_var with get_cpu_var"). So we had it fixed.

Christoph Lameter was doing his __get_cpu_var() replaces and forgot
PowerPC. Then he noticed this and sent his fixed up batch again which
got applied as 69111bac42 ("powerpc: Replace __get_cpu_var uses").

The careful reader will noticed one little detail: get_cpu_var() got
replaced with this_cpu_ptr(). So now we have a put_cpu_var() which does
a preempt_enable() and nothing that does preempt_disable() so we
underflow the preempt counter.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-29 12:08:07 +11:00
Linus Torvalds
d5e2d00898 powerpc updates for 4.6
Highlights:
  - Restructure Linux PTE on Book3S/64 to Radix format from Paul Mackerras
  - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh Kumar K.V
  - Add POWER9 cputable entry from Michael Neuling
  - FPU/Altivec/VSX save/restore optimisations from Cyril Bur
  - Add support for new ftrace ABI on ppc64le from Torsten Duwe
 
 Various cleanups & minor fixes from:
  - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy, Cyril
    Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell Currey,
    Sukadev Bhattiprolu, Suraj Jitindar Singh.
 
 General:
  - atomics: Allow architectures to define their own __atomic_op_* helpers from
    Boqun Feng
  - Implement atomic{, 64}_*_return_* variants and acquire/release/relaxed
    variants for (cmp)xchg from Boqun Feng
  - Add powernv_defconfig from Jeremy Kerr
  - Fix BUG_ON() reporting in real mode from Balbir Singh
  - Add xmon command to dump OPAL msglog from Andrew Donnellan
  - Add xmon command to dump process/task similar to ps(1) from Douglas Miller
  - Clean up memory hotplug failure paths from David Gibson
 
 pci/eeh:
  - Redesign SR-IOV on PowerNV to give absolute isolation between VFs from Wei
    Yang.
  - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
  - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
  - PCI: Add pcibios_bus_add_device() weak function from Wei Yang
  - MAINTAINERS: Update EEH details and maintainership from Russell Currey
 
 cxl:
  - Support added to the CXL driver for running on both bare-metal and
    hypervisor systems, from Christophe Lombard and Frederic Barrat.
  - Ignore probes for virtual afu pci devices from Vaibhav Jain
 
 perf:
  - Export Power8 generic and cache events to sysfs from Sukadev Bhattiprolu
  - hv-24x7: Fix usage with chip events, display change in counter values,
    display domain indices in sysfs, eliminate domain suffix in event names,
    from Sukadev Bhattiprolu
 
 Freescale:
  - Updates from Scott: "Highlights include 8xx optimizations, 32-bit checksum
    optimizations, 86xx consolidation, e5500/e6500 cpu hotplug, more fman and
    other dt bits, and minor fixes/cleanup."
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW69OrAAoJEFHr6jzI4aWAe5EQAJw/hE6WBQc6a7Tj70AnXOqR
 qk/m5pZjuTwQxfBteIvHR1pE5eXdlvtAjcD254LVkFkAbIn19W/h2k0VX/nlee7P
 n/VRHRifjtGmukqHrPYJJ7ua9mNlY7pxh3leGSixBFASnSWqMxNNNziNQtSTcuCs
 TjHiw6NkZ/kzeunA4bAfE4yHVUZjmL74oiS9JbLyaVHqoW4fqWLlh26AKo2yYMZI
 qPicBBG4HBi3FGvoexnKxlJNdcV4HO7LzDjJmCSfUKYCJi+Pw19T5qmhso0q0qVz
 vHg/A8HNeG4Hn83pNVmLeQSAIQRZ3DvTtcLgbjPo+TVwm/hzrRRBWipTeOVbkLW8
 2bcOXT4t7LWUq15EAJ1LYgYZGzcLrfRfUeOcuQ1TWd3+PcfY9pE7FmizsxAAfaVe
 E9j9mpz4XnIqBtWkFHneTIHkQ5OWptyKuZJEaYH0nut4VsP0k8NarkseafGqBPu7
 5eG83gbiQbCVixfOgblV9eocJ29JcwpjPAY4CZSGJimShg909FV7WRgZgJkKWrbK
 dBRco8Jcp4VglGfo2qymv7Uj4KwQoypBREOhiKUvrAsVlDxPfx+bcskhjGu9xGDC
 xs/+nme0/lKa/wg5K4C3mQ1GAlkMWHI0ojhJjsyODbetup5UbkEu03wjAaTdO9dT
 Y6ptGm0rYAJluPNlziFj
 =qkAt
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "This was delayed a day or two by some build-breakage on old toolchains
  which we've now fixed.

  There's two PCI commits both acked by Bjorn.

  There's one commit to mm/hugepage.c which is (co)authored by Kirill.

  Highlights:
   - Restructure Linux PTE on Book3S/64 to Radix format from Paul
     Mackerras
   - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh
     Kumar K.V
   - Add POWER9 cputable entry from Michael Neuling
   - FPU/Altivec/VSX save/restore optimisations from Cyril Bur
   - Add support for new ftrace ABI on ppc64le from Torsten Duwe

  Various cleanups & minor fixes from:
   - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy,
     Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell
     Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh.

  General:
   - atomics: Allow architectures to define their own __atomic_op_*
     helpers from Boqun Feng
   - Implement atomic{, 64}_*_return_* variants and acquire/release/
     relaxed variants for (cmp)xchg from Boqun Feng
   - Add powernv_defconfig from Jeremy Kerr
   - Fix BUG_ON() reporting in real mode from Balbir Singh
   - Add xmon command to dump OPAL msglog from Andrew Donnellan
   - Add xmon command to dump process/task similar to ps(1) from Douglas
     Miller
   - Clean up memory hotplug failure paths from David Gibson

  pci/eeh:
   - Redesign SR-IOV on PowerNV to give absolute isolation between VFs
     from Wei Yang.
   - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
   - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
   - PCI: Add pcibios_bus_add_device() weak function from Wei Yang
   - MAINTAINERS: Update EEH details and maintainership from Russell
     Currey

  cxl:
   - Support added to the CXL driver for running on both bare-metal and
     hypervisor systems, from Christophe Lombard and Frederic Barrat.
   - Ignore probes for virtual afu pci devices from Vaibhav Jain

  perf:
   - Export Power8 generic and cache events to sysfs from Sukadev
     Bhattiprolu
   - hv-24x7: Fix usage with chip events, display change in counter
     values, display domain indices in sysfs, eliminate domain suffix in
     event names, from Sukadev Bhattiprolu

  Freescale:
   - Updates from Scott: "Highlights include 8xx optimizations, 32-bit
     checksum optimizations, 86xx consolidation, e5500/e6500 cpu
     hotplug, more fman and other dt bits, and minor fixes/cleanup"

* tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
  powerpc: Fix unrecoverable SLB miss during restore_math()
  powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers
  powerpc/rcpm: Fix build break when SMP=n
  powerpc/book3e-64: Use hardcoded mttmr opcode
  powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible
  powerpc/T104xRDB: add tdm riser card node to device tree
  powerpc32: PAGE_EXEC required for inittext
  powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree
  powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)
  powerpc/86xx: Introduce and use common dtsi
  powerpc/86xx: Update device tree
  powerpc/86xx: Move dts files to fsl directory
  powerpc/86xx: Switch to kconfig fragments approach
  powerpc/86xx: Update defconfigs
  powerpc/86xx: Consolidate common platform code
  powerpc32: Remove one insn in mulhdu
  powerpc32: small optimisation in flush_icache_range()
  powerpc: Simplify test in __dma_sync()
  powerpc32: move xxxxx_dcache_range() functions inline
  powerpc32: Remove clear_pages() and define clear_page() inline
  ...
2016-03-19 15:38:41 -07:00
Joonsoo Kim
fe896d1878 mm: introduce page reference manipulation functions
The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count.  Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it.  Then, it is hard to find actual
reason of CMA allocation failure.  CMA allocation should be guaranteed
to succeed so finding offending place is really important.

In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function.  This is preparation step to
add tracepoint to each page reference manipulation function.  With this
facility, we can easily find reason of CMA allocation failure.  There is
no functional change in this patch.

In addition, this patch also converts reference read sites.  It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joonsoo Kim
e7df0d88c4 powerpc: query dynamic DEBUG_PAGEALLOC setting
We can disable debug_pagealloc processing even if the code is compiled
with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
whether it is enabled or not in runtime.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Linus Torvalds
10dc374766 One of the largest releases for KVM... Hardly any generic improvement,
but lots of architecture-specific changes.
 
 * ARM:
 - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
 - PMU support for guests
 - 32bit world switch rewritten in C
 - various optimizations to the vgic save/restore code.
 
 * PPC:
 - enabled KVM-VFIO integration ("VFIO device")
 - optimizations to speed up IPIs between vcpus
 - in-kernel handling of IOMMU hypercalls
 - support for dynamic DMA windows (DDW).
 
 * s390:
 - provide the floating point registers via sync regs;
 - separated instruction vs. data accesses
 - dirty log improvements for huge guests
 - bugfixes and documentation improvements.
 
 * x86:
 - Hyper-V VMBus hypercall userspace exit
 - alternative implementation of lowest-priority interrupts using vector
 hashing (for better VT-d posted interrupt support)
 - fixed guest debugging with nested virtualizations
 - improved interrupt tracking in the in-kernel IOAPIC
 - generic infrastructure for tracking writes to guest memory---currently
 its only use is to speedup the legacy shadow paging (pre-EPT) case, but
 in the future it will be used for virtual GPUs as well
 - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJW5r3BAAoJEL/70l94x66D2pMH/jTSWWwdTUJMctrDjPVzKzG0
 yOzHW5vSLFoFlwEOY2VpslnXzn5TUVmCAfrdmFNmQcSw6hGb3K/xA/ZX/KLwWhyb
 oZpr123ycahga+3q/ht/dFUBCCyWeIVMdsLSFwpobEBzPL0pMgc9joLgdUC6UpWX
 tmN0LoCAeS7spC4TTiTTpw3gZ/L+aB0B6CXhOMjldb9q/2CsgaGyoVvKA199nk9o
 Ngu7ImDt7l/x1VJX4/6E/17VHuwqAdUrrnbqerB/2oJ5ixsZsHMGzxQ3sHCmvyJx
 WG5L00ubB1oAJAs9fBg58Y/MdiWX99XqFhdEfxq4foZEiQuCyxygVvq3JwZTxII=
 =OUZZ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "One of the largest releases for KVM...  Hardly any generic
  changes, but lots of architecture-specific updates.

  ARM:
   - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
   - PMU support for guests
   - 32bit world switch rewritten in C
   - various optimizations to the vgic save/restore code.

  PPC:
   - enabled KVM-VFIO integration ("VFIO device")
   - optimizations to speed up IPIs between vcpus
   - in-kernel handling of IOMMU hypercalls
   - support for dynamic DMA windows (DDW).

  s390:
   - provide the floating point registers via sync regs;
   - separated instruction vs.  data accesses
   - dirty log improvements for huge guests
   - bugfixes and documentation improvements.

  x86:
   - Hyper-V VMBus hypercall userspace exit
   - alternative implementation of lowest-priority interrupts using
     vector hashing (for better VT-d posted interrupt support)
   - fixed guest debugging with nested virtualizations
   - improved interrupt tracking in the in-kernel IOAPIC
   - generic infrastructure for tracking writes to guest
     memory - currently its only use is to speedup the legacy shadow
     paging (pre-EPT) case, but in the future it will be used for
     virtual GPUs as well
   - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
  KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
  KVM: x86: disable MPX if host did not enable MPX XSAVE features
  arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
  arm64: KVM: vgic-v3: Reset LRs at boot time
  arm64: KVM: vgic-v3: Do not save an LR known to be empty
  arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
  arm64: KVM: vgic-v3: Avoid accessing ICH registers
  KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
  KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
  KVM: arm/arm64: vgic-v2: Reset LRs at boot time
  KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
  KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
  KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
  KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
  KVM: s390: allocate only one DMA page per VM
  KVM: s390: enable STFLE interpretation only if enabled for the guest
  KVM: s390: wake up when the VCPU cpu timer expires
  KVM: s390: step the VCPU timer while in enabled wait
  KVM: s390: protect VCPU cpu timer with a seqcount
  KVM: s390: step VCPU cpu timer during kvm_run ioctl
  ...
2016-03-16 09:55:35 -07:00
Linus Torvalds
d37a14bb5f Merge branch 'core-resources-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull ram resource handling changes from Ingo Molnar:
 "Core kernel resource handling changes to support NVDIMM error
  injection.

  This tree introduces a new I/O resource type, IORESOURCE_SYSTEM_RAM,
  for System RAM while keeping the current IORESOURCE_MEM type bit set
  for all memory-mapped ranges (including System RAM) for backward
  compatibility.

  With this resource flag it no longer takes a strcmp() loop through the
  resource tree to find "System RAM" resources.

  The new resource type is then used to extend ACPI/APEI error injection
  facility to also support NVDIMM"

* 'core-resources-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ACPI/EINJ: Allow memory error injection to NVDIMM
  resource: Kill walk_iomem_res()
  x86/kexec: Remove walk_iomem_res() call with GART type
  x86, kexec, nvdimm: Use walk_iomem_res_desc() for iomem search
  resource: Add walk_iomem_res_desc()
  memremap: Change region_intersects() to take @flags and @desc
  arm/samsung: Change s3c_pm_run_res() to use System RAM type
  resource: Change walk_system_ram() to use System RAM type
  drivers: Initialize resource entry to zero
  xen, mm: Set IORESOURCE_SYSTEM_RAM to System RAM
  kexec: Set IORESOURCE_SYSTEM_RAM for System RAM
  arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM
  ia64: Set System RAM type and descriptor
  x86/e820: Set System RAM type and descriptor
  resource: Add I/O resource descriptor
  resource: Handle resource flags properly
  resource: Add System RAM resource type
2016-03-14 15:15:51 -07:00
Christophe Leroy
060ef9d89d powerpc32: PAGE_EXEC required for inittext
PAGE_EXEC is required for inittext, otherwise CONFIG_DEBUG_PAGEALLOC
ends up with an Oops

[    0.000000] Inode-cache hash table entries: 8192 (order: 1, 32768 bytes)
[    0.000000] Sorting __ex_table...
[    0.000000] bootmem::free_all_bootmem_core nid=0 start=0 end=2000
[    0.000000] Unable to handle kernel paging request for instruction fetch
[    0.000000] Faulting instruction address: 0xc045b970
[    0.000000] Oops: Kernel access of bad area, sig: 11 [#1]
[    0.000000] PREEMPT DEBUG_PAGEALLOC CMPC885
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.25-local-dirty #1673
[    0.000000] task: c04d83d0 ti: c04f8000 task.ti: c04f8000
[    0.000000] NIP: c045b970 LR: c045b970 CTR: 0000000a
[    0.000000] REGS: c04f9ea0 TRAP: 0400   Not tainted  (3.18.25-local-dirty)
[    0.000000] MSR: 08001032 <ME,IR,DR,RI>  CR: 39955d35  XER: a000ff40
[    0.000000]
GPR00: c045b970 c04f9f50 c04d83d0 00000000 ffffffff c04dcdf4 00000048 c04f6b10
GPR08: c04f6ab0 00000001 c0563488 c04f6ab0 c04f8000 00000000 00000000 b6db6db7
GPR16: 00003474 00000180 00002000 c7fec000 00000000 000003ff 00000176 c0415014
GPR24: c0471018 c0414ee8 c05304e8 c03aeaac c0510000 c0471018 c0471010 00000000
[    0.000000] NIP [c045b970] free_all_bootmem+0x164/0x228
[    0.000000] LR [c045b970] free_all_bootmem+0x164/0x228
[    0.000000] Call Trace:
[    0.000000] [c04f9f50] [c045b970] free_all_bootmem+0x164/0x228 (unreliable)
[    0.000000] [c04f9fa0] [c0454044] mem_init+0x3c/0xd0
[    0.000000] [c04f9fb0] [c045080c] start_kernel+0x1f4/0x390
[    0.000000] [c04f9ff0] [c0002214] start_here+0x38/0x98
[    0.000000] Instruction dump:
[    0.000000] 2f150000 7f968840 72a90001 3ad60001 56b5f87e 419a0028 419e0024 41a20018
[    0.000000] 807cc20c 38800000 7c638214 4bffd2f5 <3a940001> 3a100024 4bffffc8 7e368b78
[    0.000000] ---[ end trace dc8fa200cb88537f ]---

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-11 20:04:32 -06:00
Christophe Leroy
8478d7f091 powerpc: Simplify test in __dma_sync()
This simplification helps the compiler. We now have only one test
instead of two, so it reduces the number of branches.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-11 17:20:12 -06:00
Christophe Leroy
766d45cbee powerpc/8xx: rewrite flush_instruction_cache() in C
On PPC8xx, flushing instruction cache is performed by writing
in register SPRN_IC_CST. This registers suffers CPU6 ERRATA.
The patch rewrites the fonction in C so that CPU6 ERRATA will
be handled transparently

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-11 17:20:11 -06:00
Christophe Leroy
a7761fe489 powerpc/8xx: rewrite set_context() in C
There is no real need to have set_context() in assembly.
Now that we have mtspr() handling CPU6 ERRATA directly, we
can rewrite set_context() in C language for easier maintenance.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-11 17:20:11 -06:00
Christophe Leroy
e974cd4be0 powerpc32: remove ioremap_base
ioremap_base is not initialised and is nowhere used so remove it

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-11 17:18:02 -06:00
Christophe Leroy
c562eb06d5 powerpc32: Remove useless/wrong MMU:setio progress message
Commit 7711684947 ("[POWERPC] Remove unused machine call outs")
removed the call to setup_io_mappings(), so remove the associated
progress line message

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-11 17:18:02 -06:00
Christophe Leroy
3084cdb7cd powerpc32: refactor x_mapped_by_bats() and x_mapped_by_tlbcam() together
x_mapped_by_bats() and x_mapped_by_tlbcam() serve the same kind of
purpose, and are never defined at the same time.
So rename them x_block_mapped() and define them in the relevant
places

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-11 17:18:02 -06:00
Christophe Leroy
516d91893b powerpc/8xx: move setup_initial_memory_limit() into 8xx_mmu.c
Now we have a 8xx specific .c file for that so put it in there
as other powerpc variants do

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-11 17:18:01 -06:00
Christophe Leroy
a372acfac5 powerpc/8xx: Map linear kernel RAM with 8M pages
On a live running system (VoIP gateway for Air Trafic Control), over
a 10 minutes period (with 277s idle), we get 87 millions DTLB misses
and approximatly 35 secondes are spent in DTLB handler.
This represents 5.8% of the overall time and even 10.8% of the
non-idle time.
Among those 87 millions DTLB misses, 15% are on user addresses and
85% are on kernel addresses. And within the kernel addresses, 93%
are on addresses from the linear address space and only 7% are on
addresses from the virtual address space.

MPC8xx has no BATs but it has 8Mb page size. This patch implements
mapping of kernel RAM using 8Mb pages, on the same model as what is
done on the 40x.

In 4k pages mode, each PGD entry maps a 4Mb area: we map every two
entries to the same 8Mb physical page. In each second entry, we add
4Mb to the page physical address to ease life of the FixupDAR
routine. This is just ignored by HW.

In 16k pages mode, each PGD entry maps a 64Mb area: each PGD entry
will point to the first page of the area. The DTLB handler adds
the 3 bits from EPN to map the correct page.

With this patch applied, we now get only 13 millions TLB misses
during the 10 minutes period. The idle time has increased to 313s
and the overall time spent in DTLB miss handler is 6.3s, which
represents 1% of the overall time and 2.2% of non-idle time.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-11 17:18:01 -06:00
Paolo Bonzini
ab92f30875 KVM/ARM updates for 4.6
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
 - PMU support for guests
 - 32bit world switch rewritten in C
 - Various optimizations to the vgic save/restore code
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW36xjAAoJECPQ0LrRPXpDGQkQAMDppzcTOixT3e8VPdHAX09a
 Z5PO0gyTMVV7Jyz5Ul3pedPJA2GSK9mxOCwqvIFbdxLAR6ZB00juO5FrTHkSdI91
 1XLPj4bKoMWcVvhL/g5A4Glp/pVMW1k/9Yq8zZAtYlsLRlqG5rLOutSadcqHcYaJ
 cTD/pFf7b2oPtkTPyoFml75KgHBT/8uvAvFDOWA66Id2z6T11+PsBT/6XnGDiwKg
 tpGTNzx3kPIKIzOAOHqVW6UBxFOeabebXLT8wUz3VwNn/UbG6gkumMNApMAyF2q1
 zU0nAh8+7Ek6Dr4OFWE6BfW6sgg/l7i1lA8XoAmqG7ZTrSptCc59fvaZJxPruG+Q
 dMsU6QgR77JJjbZTinf9a1jReZ/liZrx2gZXedVKdILrjmDSq0UnGcxjUOEDZOGy
 2/dbrlJhv+LhpcJtuPpxPCfoqbW5L0ynzmuYuXRdRz3lTHiOWIRx5gugrhO+wH4D
 4gvZhbw3XCiYfpYHYhl8A1EH5kanKgdXDocz9yIm7mZm89gngufF/HkeXS3ZU25T
 yThyBGulGjqN4FCdgf1HolkTfFjnfSx4qJovJ58eHga+HNLXRkTecZZcbFy2OOHv
 8Bx0PIlwj4RgSaRLWQUudAhdhKS2g22DKDDljxFwhkMPNghvqkYMJCRDKLu6GBXQ
 4YsLKM+TaShHFjSpx+ao
 =rpvb
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM updates for 4.6

- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- Various optimizations to the vgic save/restore code

Conflicts:
	include/uapi/linux/kvm.h
2016-03-09 11:50:42 +01:00
Linus Torvalds
b8155fe1b2 powerpc fixes for 4.5 #4
- cxl: Fix PSL timebase synchronization detection from Frederic Barrat
  - Fix oops when destroying hw_breakpoint event from Ravi Bangoria
  - Avoid lbarx on e5500 from Scott Wood
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIbBAABAgAGBQJW21wIAAoJEFHr6jzI4aWAiTQP+PWe24xwWmC/9FWBhT5rmHbs
 CO82Wy2X3gjMsQoXGqe48ohfToJAWEt5yXS+H/PKpoTYJ/yFcih2ufRw55Y6xzHs
 M98c4Rt4mAbGwkmwWTBjSc1zP+ReWeMNwRE69j6N6Xl+fDqGg6tMKrcysh6wdDLB
 ZhLeHZbVEGhIlXuxMjGyPrMxSrvcEZMMVrxp8eTycC6+wBHB4w4WzOpFx87WM0fn
 3g1Q5WWzRjW2iCeWa6tJLwIIzEACrFs8nfJU08mJU1n33YYVbpu+2klG+5ppfGAL
 JiuKt9/ZoWSAMwNU7yEwFdgjtnOzywTnt6kAUWFdMKLV7DktJ+sf0ucZF9oBg87e
 ermsXt+Jy591NXgoif5gFMELWouC+Ti9pnqxdOTawYC8TKFm0OGExJFZQMGmzBw1
 9ZpO/2pl0bsuNZyClspuFyVRrNb1sxddisSyKlxW+0Vnz1P432fRZZSz2gHLvClG
 RjQ+nsKZYj6vH0wfA+eG9+/a5iYjbfBKRNSjrO1g8pbc3ELeG5C2kXDePKnkt8+J
 jpvTLHxaRp44wNPcizPrEthrdlKxE8uvTUosRuSMSJu74SK7t8+9kCvgDnHYne51
 f/2Ko7CzAV2SfdKB+nWI2oJxfYpPlxqORZkYGkQZIOD5uQ8pvJ4yCNzy43XqFHFE
 DyJOYeTxVQmfMT6jkrg=
 =ZcDj
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 - cxl: Fix PSL timebase synchronization detection from Frederic Barrat
 - Fix oops when destroying hw_breakpoint event from Ravi Bangoria
 - Avoid lbarx on e5500 from Scott Wood

* tag 'powerpc-4.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/fsl-book3e: Avoid lbarx on e5500
  powerpc/hw_breakpoint: Fix oops when destroying hw_breakpoint event
  cxl: Fix PSL timebase synchronization detection
2016-03-06 11:08:06 -08:00
chenhui zhao
ebb9d30a6a powerpc/mm: any thread in one core can be the first to setup TLB1
On e6500, in the case of cpu hotplug, either thread in one core
may be the first thread initilzing the TLB1. The subsequent threads
must not setup it again.

The code is derived from the comment of Scott Wood.

Signed-off-by: Chenhui Zhao <chenhui.zhao@freescale.com>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-04 23:44:02 -06:00
Ingo Molnar
bc94b99636 Linux 4.5-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW0yM6AAoJEHm+PkMAQRiGeUwIAJRTHFPJTFpJcJjeZEV4/EL1
 7Pl0WSHs/CWBkXIevAg2HgkECSQ9NI9FAUFvoGxCldDpFAnL1U2QV8+Ur2qhiXMG
 5v0jILJuiw57qT/NfhEudZolerlRoHILmB3JRTb+DUV4GHZuWpTkJfUSI9j5aTEl
 w83XUgtK4bKeIyFbHdWQk6xqfzfFBSuEITuSXreOMwkFfMmeScE0WXOPLBZWyhPa
 v0rARJLYgM+vmRAnJjnG8unH+SgnqiNcn2oOFpevKwmpVcOjcEmeuxh/HdeZf7HM
 /R8F86OwdmXsO+z8dQxfcucLg+I9YmKfFr8b6hopu1sRztss2+Uk6H1j2J7IFIg=
 =tvkh
 -----END PGP SIGNATURE-----

Merge tag 'v4.5-rc6' into core/resources, to resolve conflict

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-04 12:12:08 +01:00
Scott Wood
37c5e942bb powerpc/fsl-book3e: Avoid lbarx on e5500
lbarx/stbcx. are implemented on e6500, but not on e5500.
Likewise, SMT is on e6500, but not on e5500.

So, avoid executing an unimplemented instruction by only locking
when needed (i.e. in the presence of SMT).

Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-03 23:43:05 -06:00
Aneesh Kumar K.V
c367a44133 powerpc/mm: add _PAGE_HASHPTE similar to 4K hash
We don't need to update linux page table entry with _PAGE_HASHPTE early
in hash pte fault. A parallel pte update will loop via _PAGE_BUSY
and look at _PAGE_HASHPTE for a required hpte flush only if
_PAGE_BUSY is cleared. That ensures a pte update will wait for a
parallel hpte insert to finish before looking at _PAGE_HASHPTE bit.

To avoid further confusion drop setting _PAGE_HASHPTE in cmpxchg in __hash_page_4K.

commit 41743a4e34 ("powerpc: Free a PTE bit on ppc64 with 64K pages")
did similar change for 64K config

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-03 21:18:29 +11:00
Aneesh Kumar K.V
e9a681478c powerp/mm: Update code comments
We are updating pte in those functions.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-03 21:18:29 +11:00
Kirill A. Shutemov
ff20c2e0ac mm: Some arch may want to use HPAGE_PMD related values as variables
With next generation power processor, we are having a new mmu model
[1] that require us to maintain a different linux page table format.

Inorder to support both current and future ppc64 systems with a single
kernel we need to make sure kernel can select between different page
table format at runtime. With the new MMU (radix MMU) added, we will
have two different pmd hugepage size 16MB for hash model and 2MB for
Radix model. Hence make HPAGE_PMD related values as a variable.

Actual conversion of HPAGE_PMD to a variable for ppc64 happens in a
followup patch.

[1] http://ibm.biz/power-isa3 (Needs registration).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-03 21:18:29 +11:00
Aneesh Kumar K.V
368ced78e6 powerpc/mm: Switch book3s 64 with 64K page size to 4 level page table
This is needed so that we can support both hash and radix page table
using single kernel. Radix kernel uses a 4 level table.

We now use physical address in upper page table tree levels. Even though
they are aligned to their size, for the masked bits we use the
bit positions as per PowerISA 3.0.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-03 21:18:28 +11:00
David Gibson
5c3c7ede2b powerpc/mm: Split hash page table sizing heuristic into a helper
htab_get_table_size() either retrieve the size of the hash page table (HPT)
from the device tree - if the HPT size is determined by firmware - or
uses a heuristic to determine a good size based on RAM size if the kernel
is responsible for allocating the HPT.

To support a PAPR extension allowing resizing of the HPT, we're going to
want the memory size -> HPT size logic elsewhere, so split it out into a
helper function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-02 09:06:16 +11:00
David Gibson
1dace6c665 powerpc/mm: Clean up memory hotplug failure paths
This makes a number of cleanups to handling of mapping failures during
memory hotplug on Power:

For errors creating the linear mapping for the hot-added region:
  * This is now reported with EFAULT which is more appropriate than the
    previous EINVAL (the failure is unlikely to be related to the
    function's parameters)
  * An error in this path now prints a warning message, rather than just
    silently failing to add the extra memory.
  * Previously a failure here could result in the region being partially
    mapped.  We now clean up any partial mapping before failing.

For errors creating the vmemmap for the hot-added region:
   * This is now reported with EFAULT instead of causing a BUG() - this
     could happen for external reason (e.g. full hash table) so it's better
     to handle this non-fatally
   * An error message is also printed, so the failure won't be silent
   * As above a failure could cause a partially mapped region, we now
     clean this up. [mpe: move htab_remove_mapping() out of #ifdef
     CONFIG_MEMORY_HOTPLUG to enable this]

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-01 22:04:18 +11:00
David Gibson
27828f98a0 powerpc/mm: Handle removing maybe-present bolted HPTEs
At the moment the hpte_removebolted callback in ppc_md returns void and
will BUG_ON() if the hpte it's asked to remove doesn't exist in the first
place.  This is awkward for the case of cleaning up a mapping which was
partially made before failing.

So, we add a return value to hpte_removebolted, and have it return ENOENT
in the case that the HPTE to remove didn't exist in the first place.

In the (sole) caller, we propagate errors in hpte_removebolted to its
caller to handle.  However, we handle ENOENT specially, continuing to
complete the unmapping over the specified range before returning the error
to the caller.

This means that htab_remove_mapping() will work sanely on a partially
present mapping, removing any HPTEs which are present, while also returning
ENOENT to its caller in case it's important there.

There are two callers of htab_remove_mapping():
   - In remove_section_mapping() we already WARN_ON() any error return,
     which is reasonable - in this case the mapping should be fully
     present
   - In vmemmap_remove_mapping() we BUG_ON() any error.  We change that to
     just a WARN_ON() in the case of ENOENT, since failing to remove a
     mapping that wasn't there in the first place probably shouldn't be
     fatal.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-01 22:04:18 +11:00
David Gibson
abd0a0e791 powerpc/mm: Clean up error handling for htab_remove_mapping
Currently, the only error that htab_remove_mapping() can report is -EINVAL,
if removal of bolted HPTEs isn't implemeted for this platform.  We make
a few clean ups to the handling of this:

 * EINVAL isn't really the right code - there's nothing wrong with the
   function's arguments - use ENODEV instead
 * We were also printing a warning message, but that's a decision better
   left up to the callers, so remove it
 * One caller is vmemmap_remove_mapping(), which will just BUG_ON() on
   error, making the warning message redundant, so no change is needed
   there.
 * The other caller is remove_section_mapping().  This is called in the
   memory hot remove path at a point after vmemmap_remove_mapping() so
   if hpte_removebolted isn't implemented, we'd expect to have already
   BUG()ed anyway.  Put a WARN_ON() here, in lieu of a printk() since this
   really shouldn't be happening.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-01 22:04:17 +11:00
Adam Buchbinder
446957ba51 powerpc: Fix misspellings in comments.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-01 19:27:20 +11:00
Paul Mackerras
a9d4996df1 powerpc/mm/book3s-64: Move HPTE-related bits in PTE to upper end
This moves the _PAGE_HASHPTE, _PAGE_F_GIX and _PAGE_F_SECOND fields in
the Linux PTE on 64-bit Book 3S systems to the most significant byte.
Of the 5 bits, one is a software-use bit and the other four are
reserved bit positions in the PowerISA v3.0 radix PTE format.
Using these bits is OK because these bits are all to do with tracking
the HPTE(s) associated with the Linux PTE, and therefore won't be
needed in radix mode.  This frees up bit positions in the lower two
bytes.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-29 20:34:39 +11:00
Paul Mackerras
849f86a630 powerpc/mm/book3s-64: Move _PAGE_PRESENT to the most significant bit
This changes _PAGE_PRESENT for 64-bit Book 3S processors from 0x2 to
0x8000_0000_0000_0000, because that is where PowerISA v3.0 CPUs in
radix mode will expect to find it.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-29 20:34:39 +11:00
Paul Mackerras
c61a884312 powerpc/mm/book3s-64: Use physical addresses in upper page table tree levels
This changes the Linux page tables to store physical addresses
rather than kernel virtual addresses in the upper levels of the
tree (pgd, pud and pmd) for 64-bit Book 3S machines.

This also changes the hugepd pointers used to implement hugepages
when the base page size is 4k to store physical addresses rather than
virtual addresses (again just for 64-bit Book3S machines).

This frees up some high order bits, and will be needed with
PowerISA v3.0 machines which read the page table tree in hardware
in radix mode.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-29 20:34:34 +11:00
Daniel Cashman
5ef11c35ce mm: ASLR: use get_random_long()
Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long().  Also address shifting bug which, in
case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.

Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Paul Mackerras
f1a9ae034a powerpc/mm/book3s-64: Free up 7 high-order bits in the Linux PTE
This frees up bits 57-63 in the Linux PTE on 64-bit Book 3S machines.
In the 4k page case, this is done just by reducing the size of the
RPN field to 39 bits, giving 51-bit real addresses.  In the 64k page
case, we had 10 unused bits in the middle of the PTE, so this moves
the RPN field down 10 bits to make use of those unused bits.  This
means the RPN field is now 3 bits larger at 37 bits, giving 53-bit
real addresses in the normal case, or 49-bit real addresses for the
special 4k PFN case.

We are doing this in order to be able to move some other PTE bits
into the positions where PowerISA V3.0 processors will expect to
find them in radix-tree mode.  Ultimately we will be able to move
the RPN field to lower bit positions and make it larger.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-27 21:06:57 +11:00
Paul Mackerras
1ec3f93710 powerpc/mm/book3s-64: Clean up some obsolete or misleading comments
No code changes.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-27 21:06:57 +11:00
Aneesh Kumar K.V
9ab3ac233a powerpc/mm/hash: Clear the invalid slot information correctly
We can get a hash pte fault with 4k base page size and find the pte
already inserted with 64K base page size. In that case we need to clear
the existing slot information from the old pte. Fix this correctly

With THP, we also clear the slot information with respect to all
the 64K hash pte mapping that 16MB page. They are all invalid
now. This make sure we don't find the slot valid when we fault with
4k base page size. Finding the slot valid should not result in any wrong
behavior because we do check again in hash page table for the validity.
But we can avoid that check completely.

Fixes: a43c0eb836 ("powerpc/mm: Convert 4k hash insert to C")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-22 19:27:39 +11:00
Alexey Kardashevskiy
e9ab1a1caf powerpc: Make vmalloc_to_phys() public
This makes vmalloc_to_phys() public as there will be another user
(KVM in-kernel VFIO acceleration) for it soon. As this new user
can be compiled as a module, this exports the symbol.

As a little optimization, this changes the helper to call
vmalloc_to_pfn() instead of vmalloc_to_page() as the size of the
struct page may not be power-of-two aligned which will make gcc use
multiply instructions instead of shifts.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16 13:44:26 +11:00
Aneesh Kumar K.V
c777e2a8b6 powerpc/mm: Fix Multi hit ERAT cause by recent THP update
With ppc64 we use the deposited pgtable_t to store the hash pte slot
information. We should not withdraw the deposited pgtable_t without
marking the pmd none. This ensure that low level hash fault handling
will skip this huge pte and we will handle them at upper levels.

Recent change to pmd splitting changed the above in order to handle the
race between pmd split and exit_mmap. The race is explained below.

Consider following race:

		CPU0				CPU1
shrink_page_list()
  add_to_swap()
    split_huge_page_to_list()
      __split_huge_pmd_locked()
        pmdp_huge_clear_flush_notify()
	// pmd_none() == true
					exit_mmap()
					  unmap_vmas()
					    zap_pmd_range()
					      // no action on pmd since pmd_none() == true
	pmd_populate()

As result the THP will not be freed. The leak is detected by check_mm():

	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512

The above required us to not mark pmd none during a pmd split.

The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
level fault handling code skip this pte. At higher level we do take ptl
lock. That should serialze us against the pmd split. Once the lock is
acquired we do check the pmd again using pmd_same. That should always
return false for us and hence we should retry the access. We do the
pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)

Also make sure we wait for irq disable section in other cpus to finish
before flipping a huge pte entry with a regular pmd entry. Code paths
like find_linux_pte_or_hugepte depend on irq disable to get
a stable pte_t pointer. A parallel thp split need to make sure we
don't convert a pmd pte to a regular pmd entry without waiting for the
irq disable section to finish.

Fixes: eef1b3ba05 ("thp: implement split_huge_pmd()")
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-15 21:10:04 +11:00
Toshi Kani
35d98e93fe arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM
Set IORESOURCE_SYSTEM_RAM in flags of resource ranges with
"System RAM", "Kernel code", "Kernel data", and "Kernel bss".

Note that:

 - IORESOURCE_SYSRAM (i.e. modifier bit) is set in flags when
   IORESOURCE_MEM is already set. IORESOURCE_SYSTEM_RAM is defined
   as (IORESOURCE_MEM|IORESOURCE_SYSRAM).

 - Some archs do not set 'flags' for children nodes, such as
   "Kernel code".  This patch does not change 'flags' in this
   case.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@linux-mips.org
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-parisc@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/1453841853-11383-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:57 +01:00
Vasant Hegde
e256caa7d0 powerpc/mm: Allow user space to map rtas_rmo_buf
With commit 90a545e9 (restrict /dev/mem to idle io memory ranges) mapping
rtas_rmo_buf from user space is failing. Hence we are not able to make
RTAS syscall.

This patch calls page_is_rtas_user_buf before calling iomem_is_exclusive
in  devmem_is_allowed(). This will allow user space to map rtas_rmo_buf
and we are able to make RTAS syscall.

Reported-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
CC: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-01-25 16:31:13 +11:00
Kirill A. Shutemov
7aa9a23c69 powerpc, thp: remove infrastructure for handling splitting PMDs
With new refcounting we don't need to mark PMDs splitting.  Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
ddc58f27f9 mm: drop tail page refcounting
Tail page refcounting is utterly complicated and painful to support.

It uses ->_mapcount on tail pages to store how many times this page is
pinned.  get_page() bumps ->_mapcount on tail page in addition to
->_count on head.  This information is required by split_huge_page() to
be able to distribute pins from head of compound page to tails during
the split.

We will need ->_mapcount to account PTE mappings of subpages of the
compound page.  We eliminate need in current meaning of ->_mapcount in
tail pages by forbidding split entirely if the page is pinned.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess.  It makes get_page() and put_page() much
simpler.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
78ddc53473 thp: rename split_huge_page_pmd() to split_huge_pmd()
We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Michael Ellerman
be6bfc29bc Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next
Freescale updates from Scott:

"Highlights include moving QE code out of arch/powerpc (to be shared with
arm), device tree updates, and minor fixes."
2016-01-14 09:55:01 +11:00
Michael Ellerman
c33e54fafa powerpc: Fix build break due to paca mm_context_t changes
Commit 2fc251a8dd ("powerpc: Copy only required pieces of the
mm_context_t to the paca") broke the build for CONFIG_PPC_STD_MMU_64=y
and CONFIG_PPC_MM_SLICES=n.

That only happens for a kernel built with 4K pages and HUGETLB disabled,
which is why we missed it.

Fix it by adding a mm_ctx_user_psize member to the paca and populating
it in the appropriate places.

Fixes: 2fc251a8dd ("powerpc: Copy only required pieces of the mm_context_t to the paca")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-01-09 08:28:44 +11:00
Michael Neuling
2fc251a8dd powerpc: Copy only required pieces of the mm_context_t to the paca
Currently we copy the whole mm_context_t to the paca but only access a
few bits of it.  This is wasteful of space paca and also takes quite
some time in the hot path of context switching.

This patch pulls in only the required bits from the mm_context_t to
the paca and on context switch, copies only those.

Benchmarking this (On top of Anton's recent MSR context switching
changes [1]) using processes and yield shows an improvement of almost
3% on POWER8:

  http://ozlabs.org/~anton/junkcode/context_switch2.c
  ./context_switch2 --test=yield --process 0 0

1. https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-October/135700.html

Signed-off-by: Michael Neuling <mikey@neuling.org>
[mpe: Rename paca fields to be mm_ctx_foo rather than context_foo]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-27 19:12:14 +11:00
Scott Wood
2749539b4b powerpc/e6500: add locking to hugetlb
e6500 has threads but does not have TLB write conditional.  Thus,
the hugetlb code needs to take the same lock that the normal TLB miss
handlers take, to ensure that the tlbsx and tlbwe are atomic.

Signed-off-by: Scott Wood <scottwood@freescale.com>
2015-12-22 18:23:22 -06:00
Michael Neuling
c395465da6 powerpc: Add function to copy mm_context_t to the paca
This adds a function to copy the mm->context to the paca.  This is
only a basic conversion for now but will be used more extensively in
the next patch.

This also adds #ifdef CONFIG_PPC_BOOK3S around this code since it's
not used elsewhere.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-19 22:13:12 +11:00
Aneesh Kumar K.V
4dcbd88eb6 powerpc/mm: Don't open code pgtable_t size
The slot information of base page size hash pte is stored in the
pgtable_t w.r.t transparent hugepage. We need to make sure we don't
index beyond pgtable_t size.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:17 +11:00
Aneesh Kumar K.V
62607bc64c powerpc/mm: Don't hardcode page table size
pte and pmd table size are dependent on config items. Don't
hard code the same. This make sure we use the right value
when masking pmd entries and also while checking pmd_bad

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:15 +11:00
Aneesh Kumar K.V
6a119eae94 powerpc/mm: Add a _PAGE_PTE bit
For a pte entry we will have _PAGE_PTE set. Our pte page
address have a minimum alignment requirement of HUGEPD_SHIFT_MASK + 1.
We use the lower 7 bits to indicate hugepd. ie.

For pmd and pgd we can find:
1) _PAGE_PTE set pte -> indicate PTE
2) bits [2..6] non zero -> indicate hugepd.
   They also encode the size. We skip bit 1 (_PAGE_PRESENT).
3) othewise pointer to next table.

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:14 +11:00
Aneesh Kumar K.V
e34aa03ca4 powerpc/mm: Move THP headers around
We support THP only with book3s_64 and 64K page size. Move
THP details to hash64-64k.h to clarify the same.

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:14 +11:00
Aneesh Kumar K.V
26a344aea4 powerpc/mm: Move hugetlb related headers
W.r.t hugetlb, we support two format for pmd. With book3s_64 and
64K linux page size, we can have pte at the pmd level. Hence we
don't need to support hugepd there. For everything else hugepd
is supported and pmd_huge is (0).

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:13 +11:00
Aneesh Kumar K.V
40e8550afc powerpc/mm: Move WIMG update to helper.
Only difference here is, we apply the WIMG mapping early, so rflags
passed to updatepp will also be changed.

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:13 +11:00
Aneesh Kumar K.V
c6a3c495f0 powerpc/mm: Add helper for converting pte bit to hpte bits
Instead of open coding it in multiple code paths, export the helper
and add more documentation. Also make sure we don't make assumption
regarding pte bit position

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:12 +11:00
Aneesh Kumar K.V
a43c0eb836 powerpc/mm: Convert 4k insert from asm to C
This is similar to 64K insert. May be we want to consolidate

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:12 +11:00
Aneesh Kumar K.V
89ff725051 powerpc/mm: Convert __hash_page_64K to C
Convert from asm to C

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:11 +11:00
Aneesh Kumar K.V
506b863c68 powerpc/mm: Remove pte_val usage for the second half of pgtable_t
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:10 +11:00
Aneesh Kumar K.V
bf680d5160 powerpc/mm: Don't track subpage valid bit in pte_t
This free up 11 bits in pte_t. In the later patch we also change
the pte_t format so that we can start supporting migration pte
at pmd level. We now track 4k subpage valid bit as below

If we have _PAGE_COMBO set, we override the _PAGE_F_GIX_SHIFT
and _PAGE_F_SECOND. Together we have 4 bits, each of them
used to indicate whether any of the 4 4k subpage in that group
is valid. ie,

[ group 1 bit ]   [ group 2 bit ]  ..... [ group 4 ]
[ subpage 1 - 4]  [ subpage 5- 8]  ..... [ subpage 13 - 16]

We still track each 4k subpage slot number and secondary hash
information in the second half of pgtable_t. Removing the subpage
tracking have some significant overhead on aim9 and ebizzy benchmark and
to support THP with 4K subpage, we do need a pgtable_t of 4096 bytes.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:10 +11:00
Aneesh Kumar K.V
106713a145 powerpc/mm: Remove the dependency on pte bit position in asm code
We should not expect pte bit position in asm code. Simply
by moving part of that to C

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:10 +11:00
Aneesh Kumar K.V
91f1da9979 powerpc/mm: Convert 4k hash insert to C
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:09 +11:00
Aneesh Kumar K.V
f281b5d50c powerpc/mm: Don't use pmd_val, pud_val and pgd_val as lvalue
We convert them static inline function here as we did with pte_val in
the previous patch

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:07 +11:00
Aneesh Kumar K.V
0863d7f213 powerpc/mm: Fix infinite loop in hash fault with 4K page size
This is the same bug we fixed as part of 09567e7fd4
("powerpc/mm: Check paca psize is up to date for huge mappings"). Please
check that for details. The difference here is that faults were
happening on a 4K page at an address previously mapped by hugetlb.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14 15:19:04 +11:00
Linus Torvalds
2f4bf528ec powerpc updates for 4.4
- Kconfig: remove BE-only platforms from LE kernel build from Boqun Feng
  - Refresh ps3_defconfig from Geoff Levand
  - Emit GNU & SysV hashes for the vdso from Michael Ellerman
  - Define an enum for the bolted SLB indexes from Anshuman Khandual
  - Use a local to avoid multiple calls to get_slb_shadow() from Michael Ellerman
  - Add gettimeofday() benchmark from Michael Neuling
  - Avoid link stack corruption in __get_datapage() from Michael Neuling
  - Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar K.V
  - Add ppc64le_defconfig from Michael Ellerman
  - pseries: extract of_helpers module from Andy Shevchenko
  - Correct string length in pseries_of_derive_parent() from Nathan Fontenot
  - Free the MSI bitmap if it was slab allocated from Denis Kirjanov
  - Shorten irq_chip name for the SIU from Christophe Leroy
  - Wait 1s for secondaries to enter OPAL during kexec from Samuel Mendoza-Jonas
  - Fix _ALIGN_* errors due to type difference. from Aneesh Kumar K.V
  - powerpc/pseries/hvcserver: don't memset pi_buff if it is null from Colin Ian King
  - Disable hugepd for 64K page size. from Aneesh Kumar K.V
  - Differentiate between hugetlb and THP during page walk from Aneesh Kumar K.V
  - Make PCI non-optional for pseries from Michael Ellerman
  - Individual System V IPC system calls from Sam bobroff
  - Add selftest of unmuxed IPC calls from Michael Ellerman
  - discard .exit.data at runtime from Stephen Rothwell
  - Delete old orphaned PrPMC 280/2800 DTS and boot file. from Paul Gortmaker
  - Use of_get_next_parent to simplify code from Christophe Jaillet
  - Paginate some xmon output from Sam bobroff
  - Add some more elements to the xmon PACA dump from Michael Ellerman
  - Allow the tm-syscall selftest to build with old headers from Michael Ellerman
  - Run EBB selftests only on POWER8 from Denis Kirjanov
  - Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael Ellerman
  - Avoid reference to potentially freed memory in prom.c from Christophe Jaillet
  - Quieten boot wrapper output with run_cmd from Geoff Levand
  - EEH fixes and cleanups from Gavin Shan
  - Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
  - Use of_get_next_parent() in of_get_ibm_chip_id() from Michael Ellerman
  - Fix section mismatch warning in msi_bitmap_alloc() from Denis Kirjanov
  - Fix ps3-lpm white space from Rudhresh Kumar J
  - Fix ps3-vuart null dereference from Colin King
  - nvram: Add missing kfree in error path from Christophe Jaillet
  - nvram: Fix function name in some errors messages. from Christophe Jaillet
  - drivers/macintosh: adb: fix misleading Kconfig help text from Aaro Koskinen
  - agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
  - cxl: Free virtual PHB when removing from Andrew Donnellan
  - scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from Michael Ellerman
  - scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building with O= from Michael Ellerman
 
  - Freescale updates from Scott: Highlights include 64-bit book3e kexec/kdump
    support, a rework of the qoriq clock driver, device tree changes including
    qoriq fman nodes, support for a new 85xx board, and some fixes.
 
  - MPC5xxx updates from Anatolij: Highlights include a driver for MPC512x
    LocalPlus Bus FIFO with its device tree binding documentation, mpc512x
    device tree updates and some minor fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWPEZgAAoJEFHr6jzI4aWANjYQAKX2Q/95hqKfCuF5FBcUmtMC
 Pu/Nff027MVzxZ2ApDcvvLGps5Nz2bn3nIhc9zjkXc5E8DuL6X3Yl8ce7qyNcc3g
 cJJ8RvtUo6J1OMWetXFehtPYniAAwKMhZYKnj0+WnLr2SyH/Vhl3ehDkFbGyPtuH
 r+2E7krFjfVgU+bzciIFnOaDekFuFN/pXWMb6e6zQyBJe9N8ZIp96uouGCebKVd0
 VDLItzdaKErT8JFfbymMPvZm3V0rMVx4WWu3kAbQX8LrD5a18NF1zrjAOHRXc61n
 kkk8/DPuNOon1PbXXyiS5BcFyZRe+KE3VBnoW5sOMqMIRg5WdO1oU3e2pEfXMO8+
 leXYwFLXiKzUZuOgQG2QiUhrzD2yC1o6/TJWATv0dSl9AwrecgPX+Vj6X357slAf
 A9E3eMy5tgnpndBWZmvZS3W7YDKH+NkeZ+Q40+NErAlqr++ErrTcKVndk5vWlYTT
 7mMZeTXagX66al/k5ATKqwB7iUSpnYHSAa9fcUYPSM2FnXsDxPyeJGkBbcoOmkGj
 QrpgNYOvJaUJd076goZCV39v0c1xpfV9/9kyVch8HUadf6JcjpVZwYnbGw2qlJjh
 ZanuBG2VOeSwaKQqXiRBSBetnpAg8CVpFjDmX9wOBfSek2wxEJqDX/vQExdbIDQQ
 pUs7vnUxLzhmW/x+ygOI
 =YwcM
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Kconfig: remove BE-only platforms from LE kernel build from Boqun
   Feng
 - Refresh ps3_defconfig from Geoff Levand
 - Emit GNU & SysV hashes for the vdso from Michael Ellerman
 - Define an enum for the bolted SLB indexes from Anshuman Khandual
 - Use a local to avoid multiple calls to get_slb_shadow() from Michael
   Ellerman
 - Add gettimeofday() benchmark from Michael Neuling
 - Avoid link stack corruption in __get_datapage() from Michael Neuling
 - Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar
   K.V
 - Add ppc64le_defconfig from Michael Ellerman
 - pseries: extract of_helpers module from Andy Shevchenko
 - Correct string length in pseries_of_derive_parent() from Nathan
   Fontenot
 - Free the MSI bitmap if it was slab allocated from Denis Kirjanov
 - Shorten irq_chip name for the SIU from Christophe Leroy
 - Wait 1s for secondaries to enter OPAL during kexec from Samuel
   Mendoza-Jonas
 - Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V
 - powerpc/pseries/hvcserver: don't memset pi_buff if it is null from
   Colin Ian King
 - Disable hugepd for 64K page size, from Aneesh Kumar K.V
 - Differentiate between hugetlb and THP during page walk from Aneesh
   Kumar K.V
 - Make PCI non-optional for pseries from Michael Ellerman
 - Individual System V IPC system calls from Sam bobroff
 - Add selftest of unmuxed IPC calls from Michael Ellerman
 - discard .exit.data at runtime from Stephen Rothwell
 - Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul
   Gortmaker
 - Use of_get_next_parent to simplify code from Christophe Jaillet
 - Paginate some xmon output from Sam bobroff
 - Add some more elements to the xmon PACA dump from Michael Ellerman
 - Allow the tm-syscall selftest to build with old headers from Michael
   Ellerman
 - Run EBB selftests only on POWER8 from Denis Kirjanov
 - Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael
   Ellerman
 - Avoid reference to potentially freed memory in prom.c from Christophe
   Jaillet
 - Quieten boot wrapper output with run_cmd from Geoff Levand
 - EEH fixes and cleanups from Gavin Shan
 - Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
 - Use of_get_next_parent() in of_get_ibm_chip_id() from Michael
   Ellerman
 - Fix section mismatch warning in msi_bitmap_alloc() from Denis
   Kirjanov
 - Fix ps3-lpm white space from Rudhresh Kumar J
 - Fix ps3-vuart null dereference from Colin King
 - nvram: Add missing kfree in error path from Christophe Jaillet
 - nvram: Fix function name in some errors messages, from Christophe
   Jaillet
 - drivers/macintosh: adb: fix misleading Kconfig help text from Aaro
   Koskinen
 - agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
 - cxl: Free virtual PHB when removing from Andrew Donnellan
 - scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from
   Michael Ellerman
 - scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building
   with O= from Michael Ellerman
 - Freescale updates from Scott: Highlights include 64-bit book3e
   kexec/kdump support, a rework of the qoriq clock driver, device tree
   changes including qoriq fman nodes, support for a new 85xx board, and
   some fixes.
 - MPC5xxx updates from Anatolij: Highlights include a driver for
   MPC512x LocalPlus Bus FIFO with its device tree binding
   documentation, mpc512x device tree updates and some minor fixes.

* tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits)
  powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc()
  powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id()
  powerpc/pseries: Correct string length in pseries_of_derive_parent()
  powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
  powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s)
  powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan
  powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes
  powerpc: handle error case in cpm_muram_alloc()
  powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake
  powerpc/book3e-64: Enable kexec
  powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop
  powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32
  powerpc/book3e-64/kexec: Enable SMP release
  powerpc/book3e-64/kexec: create an identity TLB mapping
  powerpc/book3e-64: Don't limit paca to 256 MiB
  powerpc/book3e/kdump: Enable crash_kexec_wait_realmode
  powerpc/book3e: support CONFIG_RELOCATABLE
  powerpc/booke64: Fix args to copy_and_flush
  powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts
  powerpc/e6500: kexec: Handle hardware threads
  ...
2015-11-05 23:38:43 -08:00
Raghavendra K T
c118baf802 arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes
With the setup_nr_nodes(), we have already initialized
node_possible_map.  So it is safe to use for_each_node here.

There are many places in the kernel that use hardcoded 'for' loop with
nr_node_ids, because all other architectures have numa nodes populated
serially.  That should be reason we had maintained the same for
powerpc.

But, since sparse numa node ids possible on powerpc, we unnecessarily
allocate memory for non existent numa nodes.

For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and
we allocate memory for nodes 2-14.  This patch we allocate memory for
only existing numa nodes.

The patch is boot tested on a 4 node tuleta, confirming with printks
that it works as expected.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anton Blanchard <anton@samba.org>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Greg Kurz <gkurz@linux.vnet.ibm.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Kevin Hao
e1f580e8ce powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
In order to workaround Erratum A-008139, we have to invalidate the
tlb entry with tlbilx before overwriting. Due to the performance
consideration, we don't add any memory barrier when acquire/release
the tcd lock. This means the two load instructions for esel_next do
have the possibility to return different value. This is definitely
not acceptable due to the Erratum A-008139. We have two options to
fix this issue:
  a) Add memory barrier when acquire/release tcd lock to order the
     load/store to esel_next.
  b) Just make sure to invalidate and write to the same tlb entry and
     tolerate the race that we may get the wrong value and overwrite
     the tlb entry just updated by the other thread.

We observe better performance using option b. So reserve an additional
register to save the value of the esel_next.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
2015-10-27 18:14:40 -05:00
Scott Wood
eba5de8dc1 powerpc/fsl-booke-64: Don't limit ppc64_rma_size to one TLB entry
This is required for kdump to work when loaded at at an address that
does not fall within the first TLB entry -- which can easily happen
because while the lower limit is enforced via reserved memory, which
doesn't affect how much is mapped, the upper limit is enforced via a
different mechanism that does.  Thus, more TLB entries are needed than
would normally be used, as the total memory to be mapped might not be a
power of two.

Signed-off-by: Scott Wood <scottwood@freescale.com>
2015-10-27 18:13:22 -05:00
Scott Wood
d9e1831a42 powerpc/85xx: Load all early TLB entries at once
Use an AS=1 trampoline TLB entry to allow all normal TLB1 entries to
be loaded at once.  This avoids the need to keep the translation that
code is executing from in the same TLB entry in the final TLB
configuration as during early boot, which in turn is helpful for
relocatable kernels (e.g. kdump) where the kernel is not running from
what would be the first TLB entry.

On e6500, we limit map_mem_in_cams() to the primary hwthread of a
core (the boot cpu is always considered primary, as a kdump kernel
can be entered on any cpu).  Each TLB only needs to be set up once,
and when we do, we don't want another thread to be running when we
create a temporary trampoline TLB1 entry.

Signed-off-by: Scott Wood <scottwood@freescale.com>
2015-10-22 22:50:46 -05:00
Christophe Jaillet
1def37586f powerpc/numa: Use of_get_next_parent to simplify code
of_get_next_parent can be used to simplify the while() loop and
avoid the need of a temp variable.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-15 20:32:00 +11:00
Aneesh Kumar K.V
891121e6c0 powerpc/mm: Differentiate between hugetlb and THP during page walk
We need to properly identify whether a hugepage is an explicit or
a transparent hugepage in follow_huge_addr(). We used to depend
on hugepage shift argument to do that. But in some case that can
result in wrong results. For ex:

On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
We do prevent reusing the pfn page via the usage of
kick_all_cpus_sync(). But that happens after we updated the pte to 0.
Hence in follow_huge_addr() we can find hugepage shift set, but transparent
huge page check fail for a thp pte.

NOTE: We fixed a variant of this race against thp split in commit
691e95fd73
("powerpc/mm/thp: Make page table walk safe against thp split/collapse")

Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
follow_page_mask occasionally.

In the long term, we may want to switch ppc64 64k page size config to
enable CONFIG_ARCH_WANT_GENERAL_HUGETLB

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-12 15:30:09 +11:00
Aneesh Kumar K.V
ec2640b114 powerpc/mm: Disable hugepd for 64K page size.
After commit e2b3d202d1
("powerpc: Switch 16GB and 16MB explicit hugepages to a
different page table format"), we don't need to support
is_hugepd() for 64K page size.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-12 15:29:59 +11:00
Cyril Bur
fdf880a608 powerpc: Fix checkstop in native_hpte_clear() with lockdep
native_hpte_clear() is called in real mode from two places:
- Early in boot during htab initialisation if firmware assisted dump is
  active.
- Late in the kexec path.

In both contexts there is no need to disable interrupts are they are
already disabled. Furthermore, locking around the tlbie() is only required
for pre POWER5 hardware.

On POWER5 or newer hardware concurrent tlbie()s work as expected and on pre
POWER5 hardware concurrent tlbie()s could result in deadlock. This code
would only be executed at crashdump time, during which all bets are off,
concurrent tlbie()s are unlikely and taking locks is unsafe therefore the
best course of action is to simply do nothing. Concurrent tlbie()s are not
possible in the first case as secondary CPUs have not come up yet.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-09 08:01:38 +11:00
Michael Ellerman
26cd835ef8 powerpc/slb: Use a local to avoid multiple calls to get_slb_shadow()
For no reason other than it looks ugly.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-01 16:52:01 +10:00
Anshuman Khandual
1d15010c34 powerpc/slb: Define an enum for the bolted indexes
This patch defines macros for the three bolted SLB indexes we use.
Switch the functions that take the indexes as an argument to use the
enum.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-01 16:52:01 +10:00
Linus Torvalds
f240bdd2a5 powerpc fixes for 4.3
- Fix 32-bit TCE table init in kdump kernel from Nish
  - Fix kdump with non-power-of-2 crashkernel= from Nish
  - Abort cxl_pci_enable_device_hook() if PCI channel is offline from Andrew
  - Fix to release DRC when configure_connector() fails from Bharata
  - Wire up sys_userfaultfd()
  - Fix race condition in tearing down MSI interrupts from Paul
  - Fix unbalanced pci_dev_get() in cxl_probe() from Daniel
  - Fix cxl build failure due to -Wunused-variable gcc behaviour change from Ian
  - Tell the toolchain to use ABI v2 when building an LE boot wrapper from Benh
  - Fix THP to recompute hash value after a failed update from Aneesh
  - 32-bit memcpy/memset: only use dcbz once cache is enabled from Christophe
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV+h5EAAoJEFHr6jzI4aWADboP/3jc6vUtOmFZ1GBGwXrftOc0
 PGtkTtYnGbaNsp+ZBl39Y+CFsSfhFVaDgylm/8G5NMKZaSBVdcFJwfU1w6Ymn49R
 nZkAT5PC9KgT5RTuRTZ3DO/Y2RC9vg2T9pXjEn8NGYcV8GgUkc3dZAn48S3AFgnV
 4jQI5sbxvwU12XkCUn+DkETh13g3gLYtRxwehBu/S/ovED5iNHKJwnXRzxyAA969
 dARNriSeyLBVMLamJ+rJB1S5hVTZTMbughFVVFbgriyIGuC/C1g9b9GN86dCGS6w
 T6VrKveK/iVCLUB16KV8+inbfvUrXItOxhGJWPHw9uAJGLZTz5G+yLHRRPX8onyC
 pgDesJDDpP/P7sAnKto3tF1Vzi7lwVtVPC1dT1Fc9VAWJGPYC/d16EKGIpNqqlnc
 mAIJ7wcI5c/HxvqXR2rdRV6fMer+aY7utwMsh4o/gDs7ArQUcuCrOKSW0jvHGmyr
 MumARXnUGDwPnGD8IfYI2vDOvwisv9g6XACwsM+pi499SfowaiLuK3utsMccagGZ
 INFtaqS7gpcj8TTu3kymw1TbKW/tqG9T81RRZ0rFH1q3aSfvwQ9QvsXctSeqOl/n
 lkxR3Mk0CT0oupXNKV6pjqsRwLroA0AF5tGxuw4ap1Rt4i8G7WHaPgRp7SPZxtkR
 qVBryfK5bKqYtWp4ztVK
 =Dgn5
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix 32-bit TCE table init in kdump kernel from Nish

 - Fix kdump with non-power-of-2 crashkernel= from Nish

 - Abort cxl_pci_enable_device_hook() if PCI channel is offline from
   Andrew

 - Fix to release DRC when configure_connector() fails from Bharata

 - Wire up sys_userfaultfd()

 - Fix race condition in tearing down MSI interrupts from Paul

 - Fix unbalanced pci_dev_get() in cxl_probe() from Daniel

 - Fix cxl build failure due to -Wunused-variable gcc behaviour change
   from Ian

 - Tell the toolchain to use ABI v2 when building an LE boot wrapper
   from Benh

 - Fix THP to recompute hash value after a failed update from Aneesh

 - 32-bit memcpy/memset: only use dcbz once cache is enabled from
   Christophe

* tag 'powerpc-4.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc32: memset: only use dcbz once cache is enabled
  powerpc32: memcpy: only use dcbz once cache is enabled
  powerpc/mm: Recompute hash value after a failed update
  powerpc/boot: Specify ABI v2 when building an LE boot wrapper
  cxl: Fix build failure due to -Wunused-variable behaviour change
  cxl: Fix unbalanced pci_dev_get in cxl_probe
  powerpc/MSI: Fix race condition in tearing down MSI interrupts
  powerpc: Wire up sys_userfaultfd()
  powerpc/pseries: Release DRC when configure_connector fails
  cxl: abort cxl_pci_enable_device_hook() if PCI channel is offline
  powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=
  powerpc/powernv/pci-ioda: fix 32-bit TCE table init in kdump kernel
2015-09-18 08:01:06 -07:00
Aneesh Kumar K.V
36b35d5d80 powerpc/mm: Recompute hash value after a failed update
If we had secondary hash flag set, we ended up modifying hash value in
the updatepp code path. Hence with a failed updatepp we will be using
a wrong hash value for the following hash insert. Fix this by
recomputing hash before insert.

Without this patch we can end up with using wrong slot number in linux
pte. That can result in us missing an hash pte update or invalidate
which can cause memory corruption or even machine check.

Fixes: 6d492ecc64 ("powerpc/THP: Add code to handle HPTE faults for hugepages")
Cc: stable@vger.kernel.org # v3.11+
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-09-16 22:06:03 +10:00
Linus Torvalds
12f03ee606 libnvdimm for 4.3:
1/ Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.  This facility is used by the pmem driver to
    enable pfn_to_page() operations on the page frames returned by DAX
    ('direct_access' in 'struct block_device_operations'). For now, the
    'memmap' allocation for these "device" pages comes from "System
    RAM".  Support for allocating the memmap from device memory will
    arrive in a later kernel.
 
 2/ Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt().  memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects.  The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.  Completion of
    the conversion is targeted for v4.4.
 
 3/ Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.
 
 4/ Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.
 
 5/ Miscellaneous updates and fixes to libnvdimm including support
    for issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV6Nx7AAoJEB7SkWpmfYgCWyYQAI5ju6Gvw27RNFtPovHcZUf5
 JGnxXejI6/AqeTQ+IulgprxtEUCrXOHjCDA5dkjr1qvsoqK1qxug+vJHOZLgeW0R
 OwDtmdW4Qrgeqm+CPoxETkorJ8wDOc8mol81kTiMgeV3UqbYeeHIiTAmwe7VzZ0C
 nNdCRDm5g8dHCjTKcvK3rvozgyoNoWeBiHkPe76EbnxDICxCB5dak7XsVKNMIVFQ
 NuYlnw6IYN7+rMHgpgpRux38NtIW8VlYPWTmHExejc2mlioWMNBG/bmtwLyJ6M3e
 zliz4/cnonTMUaizZaVozyinTa65m7wcnpjK+vlyGV2deDZPJpDRvSOtB0lH30bR
 1gy+qrKzuGKpaN6thOISxFLLjmEeYwzYd7SvC9n118r32qShz+opN9XX0WmWSFlA
 sajE1ehm4M7s5pkMoa/dRnAyR8RUPu4RNINdQ/Z9jFfAOx+Q26rLdQXwf9+uqbEb
 bIeSQwOteK5vYYCstvpAcHSMlJAglzIX5UfZBvtEIJN7rlb0VhmGWfxAnTu+ktG1
 o9cqAt+J4146xHaFwj5duTsyKhWb8BL9+xqbKPNpXEp+PbLsrnE/+WkDLFD67jxz
 dgIoK60mGnVXp+16I2uMqYYDgAyO5zUdmM4OygOMnZNa1mxesjbDJC6Wat1Wsndn
 slsw6DkrWT60CRE42nbK
 =o57/
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This update has successfully completed a 0day-kbuild run and has
  appeared in a linux-next release.  The changes outside of the typical
  drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
  removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
  the introduction of ZONE_DEVICE + devm_memremap_pages().

  Summary:

   - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
     mechanism for adding device-driver-discovered memory regions to the
     kernel's direct map.

     This facility is used by the pmem driver to enable pfn_to_page()
     operations on the page frames returned by DAX ('direct_access' in
     'struct block_device_operations').

     For now, the 'memmap' allocation for these "device" pages comes
     from "System RAM".  Support for allocating the memmap from device
     memory will arrive in a later kernel.

   - Introduce memremap() to replace usages of ioremap_cache() and
     ioremap_wt().  memremap() drops the __iomem annotation for these
     mappings to memory that do not have i/o side effects.  The
     replacement of ioremap_cache() with memremap() is limited to the
     pmem driver to ease merging the api change in v4.3.

     Completion of the conversion is targeted for v4.4.

   - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
     driver, update the VFS DAX implementation and PMEM api to provide
     persistence guarantees for kernel operations on a DAX mapping.

   - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
     cacheable to improve performance.

   - Miscellaneous updates and fixes to libnvdimm including support for
     issuing "address range scrub" commands, clarifying the optimal
     'sector size' of pmem devices, a clarification of the usage of the
     ACPI '_STA' (status) property for DIMM devices, and other minor
     fixes"

* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
  libnvdimm, pmem: direct map legacy pmem by default
  libnvdimm, pmem: 'struct page' for pmem
  libnvdimm, pfn: 'struct page' provider infrastructure
  x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
  add devm_memremap_pages
  mm: ZONE_DEVICE for "device memory"
  mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
  dax: drop size parameter to ->direct_access()
  nd_blk: change aperture mapping from WC to WB
  nvdimm: change to use generic kvfree()
  pmem, dax: have direct_access use __pmem annotation
  dax: update I/O path to do proper PMEM flushing
  pmem: add copy_from_iter_pmem() and clear_pmem()
  pmem, x86: clean up conditional pmem includes
  pmem: remove layer when calling arch_has_wmb_pmem()
  pmem, x86: move x86 PMEM API to new pmem.h header
  libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  pmem: switch to devm_ allocations
  devres: add devm_memremap
  libnvdimm, btt: write and validate parent_uuid
  ...
2015-09-08 14:35:59 -07:00
Dan Williams
033fbae988 mm: ZONE_DEVICE for "device memory"
While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.

Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory.  Device memory has
different lifetime and performance characteristics than RAM.  However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jerome Glisse <j.glisse@gmail.com>
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-27 19:40:58 -04:00
Michael Ellerman
9698351565 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next
Freescale updates from Scott:

"Highlights include 32-bit memcpy/memset optimizations, checksum
optimizations, 85xx config fragments and updates, device tree updates,
e6500 fixes for non-SMP, and misc cleanup and minor fixes."
2015-08-27 20:13:12 +10:00
Nikunj A Dadhania
1d805440a3 powerpc/numa: initialize distance lookup table from drconf path
In some situations, a NUMA guest that supports
ibm,dynamic-memory-reconfiguration node will end up having flat NUMA
distances between nodes. This is because of two problems in the
current code.

1) Different representations of associativity lists.

   There is an assumption about the associativity list in
   initialize_distance_lookup_table(). Associativity list has two forms:

   a) [cpu,memory]@x/ibm,associativity has following
      format:
           <N> <N integers>

   b) ibm,dynamic-reconfiguration-memory/ibm,associativity-lookup-arrays

           <M> <N> <M associativity lists each having N integers>
           M = the number of associativity lists
           N = the number of entries per associativity list

   Fix initialize_distance_lookup_table() so that it does not assume
   "case a". And update the caller to skip the length field before
   sending the associativity list.

2) Distance table not getting updated from drconf path.

   Node distance table will not get initialized in certain cases as
   ibm,dynamic-reconfiguration-memory path does not initialize the
   lookup table.

   Call initialize_distance_lookup_table() from drconf path with
   appropriate associativity list.

Reported-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-18 19:34:41 +10:00
Michael Ellerman
73b341efda powerpc/mm: Drop CONFIG_PPC_HAS_HASH_64K
The relation between CONFIG_PPC_HAS_HASH_64K and CONFIG_PPC_64K_PAGES is
painfully complicated.

But if we rearrange it enough we can see that PPC_HAS_HASH_64K
essentially depends on PPC_STD_MMU_64 && PPC_64K_PAGES.

We can then notice that PPC_HAS_HASH_64K is used in files that are only
built for PPC_STD_MMU_64, meaning it's equivalent to PPC_64K_PAGES.

So replace all uses and drop it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2015-08-18 19:32:10 +10:00
Michael Ellerman
f444f1f898 powerpc/cell: Drop support for 64K local store on 4K kernels
Back in the olden days we added support for using 64K pages to map the
SPU (Synergistic Processing Unit) local store on Cell, when the main
kernel was using 4K pages.

This was useful at the time because distros were using 4K pages, but
using 64K pages on the SPUs could reduce TLB pressure there.

However these days the number of Cell users is approaching zero, and
supporting this option adds unpleasant complexity to the memory
management code.

So drop the option, CONFIG_SPU_FS_64K_LS, and all related code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
2015-08-18 19:29:49 +10:00
Kevin Hao
69399ee9cb powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes
It makes no sense to put the instructions for calculating the lock
value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
loop. And when the lock is acquired by the other thread, the current
lock value has no chance to equal with the lock value used by current
cpu. So we can skip the comparing for these two lock values in the
lbz/bne loop.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
2015-08-17 18:53:47 -05:00
Anshuman Khandual
79d0be7407 powerpc/slb: Add documentation on runtime patching of SLB encoding
This patch adds some documentation to patch_slb_encoding() explaining
how it works.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Update change log and mention the signedness of the immediate]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-12 15:04:52 +10:00
Anshuman Khandual
2be682af48 powerpc/slb: Rename all the 'slot' occurrences to 'entry'
The SLB code uses 'slot' and 'entry' interchangeably, change it to always
use 'entry'.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-12 14:50:12 +10:00