linux

Author	SHA1	Message	Date
Christophe Leroy	fd893fe56a	powerpc/mm: Fix missing page attributes in page table dump On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used. There is also _PAGE_SHARED that is missing. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-27 22:20:27 +10:00
David Gibson	9765ad134a	powerpc/mm: Ensure IRQs are off in switch_mm() powerpc expects IRQs to already be (soft) disabled when switch_mm() is called, as made clear in the commit message of `9c1e105238` ("powerpc: Allow perf_counters to access user memory at interrupt time"). Aside from any race conditions that might exist between switch_mm() and an IRQ, there is also an unconditional hard_irq_disable() in switch_slb(). If that isn't followed at some point by an IRQ enable then interrupts will remain disabled until we return to userspace. It is true that when switch_mm() is called from the scheduler IRQs are off, but not when it's called by use_mm(). Looking closer we see that last year in commit `f98db6013c` ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler") this was made more explicit by the addition of switch_mm_irqs_off() which is now called by the scheduler, vs switch_mm() which is used by use_mm(). Arguably it is a bug in use_mm() to call switch_mm() in a different context than it expects, but fixing that will take time. This was discovered recently when vhost started throwing warnings such as: BUG: sleeping function called from invalid context at kernel/mutex.c:578 in_atomic(): 0, irqs_disabled(): 1, pid: 10768, name: vhost-10760 no locks held by vhost-10760/10768. irq event stamp: 10 hardirqs last enabled at (9): _raw_spin_unlock_irq+0x40/0x80 hardirqs last disabled at (10): switch_slb+0x2e4/0x490 softirqs last enabled at (0): copy_process+0x5e8/0x1260 softirqs last disabled at (0): (null) Call Trace: show_stack+0x88/0x390 (unreliable) dump_stack+0x30/0x44 __might_sleep+0x1c4/0x2d0 mutex_lock_nested+0x74/0x5c0 cgroup_attach_task_all+0x5c/0x180 vhost_attach_cgroups_work+0x58/0x80 [vhost] vhost_worker+0x24c/0x3d0 [vhost] kthread+0xec/0x100 ret_from_kernel_thread+0x5c/0xd4 Prior to commit `04b96e5528` ("vhost: lockless enqueuing") (Aug 2016) the vhost_worker() would do a spin_unlock_irq() not long after calling use_mm(), which had the effect of reenabling IRQs. Since that commit removed the locking in vhost_worker() the body of the vhost_worker() loop now runs with interrupts off causing the warnings. This patch addresses the problem by making the powerpc code mirror the x86 code, ie. we disable interrupts in switch_mm(), and optimise the scheduler case by defining switch_mm_irqs_off(). Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [mpe: Flesh out/rewrite change log, add stable] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-25 00:24:59 +10:00
Michael Ellerman	9fc849144c	Merge branch 'topic/kprobes' into next Although most of these kprobes patches are powerpc specific, there's a couple that touch generic code (with Acks). At the moment there's one conflict with acme's tree, but it's not too bad. Still just in case some other conflicts show up, we've put these in a topic branch so another tree could merge some or all of it if necessary.	2017-04-25 00:24:04 +10:00
Naveen N. Rao	1b32cd1715	powerpc: Introduce a new helper to obtain function entry points kprobe_lookup_name() is specific to the kprobe subsystem and may not always return the function entry point (in a subsequent patch for KPROBES_ON_FTRACE). For looking up function entry points, introduce a separate helper and use it in optprobes.c Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-24 19:07:58 +10:00
Naveen N. Rao	ead514d5fb	powerpc/kprobes: Add support for KPROBES_ON_FTRACE Allow kprobes to be placed on ftrace _mcount() call sites. This optimization avoids the use of a trap, by riding on ftrace infrastructure. This depends on HAVE_DYNAMIC_FTRACE_WITH_REGS which depends on MPROFILE_KERNEL, which is only currently enabled on powerpc64le with newer toolchains. Based on the x86 code by Masami. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-24 19:07:58 +10:00
Naveen N. Rao	9a914aa682	powerpc/kprobes: Blacklist common exception handlers Blacklist all the exception common/OOL handlers as the kernel stack is not yet setup, which means we can't take a trap at this point. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-23 20:32:26 +10:00
Naveen N. Rao	7aa5b018bf	powerpc/kprobes: Blacklist exception handlers Introduce __head_end to mark end of the early fixed sections and use it to blacklist all exception handlers from kprobes. mpe: We do not need to do anything special for relocatable kernels, where the exception vectors are split from the main kernel, as the split vectors are already excluded by the check for kernel_text_address(). Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> [mpe: Move __head_end outside #ifdef 64-bit to unbreak the 32-bit build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-23 20:32:25 +10:00
Nicholas Piggin	9cba253df4	powerpc/64s: Simplify POWER9 DD1 idle workaround code The idle workaround does not need to load PACATOC, and it does not need to be called within a nested function that requires LR to be saved. Load the PACATOC at entry to the idle wakeup. It does not matter which PACA this comes from, so it's okay to call before the workaround. Then apply the workaround to get the right PACA. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-23 20:32:23 +10:00
Nicholas Piggin	0d7720a242	powerpc/64s: Idle POWER8 avoid full state loss recovery where possible If not all threads were in winkle, full state loss recovery is not necessary and can be avoided. A previous patch removed this optimisation due to some complexity with the implementation. Re-implement it by counting the number of threads in winkle with the per-core idle state. Only restore full state loss if all threads were in winkle. This has a small window of false positives right before threads execute winkle and just after they wake up, when the winkle count does not reflect the true number of threads in winkle. This is not a significant problem in comparison with even the minimum winkle duration. For correctness, a false positive is not a problem (only false negatives would be). Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-23 20:32:12 +10:00
Nicholas Piggin	adbcf8d74f	powerpc/64s: Expand core idle state bits In preparation for adding more bits to the core idle state word, move the lock bit up, and unlock by flipping the lock bit rather than masking off all but the thread bits. Add branch hints for atomic operations while we're here. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-23 20:31:49 +10:00
Nicholas Piggin	1945bc4549	powerpc/64s: Fix POWER9 machine check handler from stop state The ISA specifies power save wakeup due to a machine check exception can cause a machine check interrupt (rather than the usual system reset interrupt). The machine check handler copes with this by doing low level machine check recovery without restoring full state from idle, then queues up a machine check event for logging, then directly executes the same idle instruction it woke from. This minimises the work done before recovery is performed. The problem is that it requires machine specific instructions and knowledge of the book3s idle code. Currently it only has code to handle POWER8 idle, so POWER9 crashes when trying to execute the P8 idle instructions which don't exist in ISAv3.0B. cpu 0x0: Vector: e40 (Emulation Assist) at [c0000000008f3810] pc: c000000000008380: machine_check_handle_early+0x130/0x2f0 lr: c00000000053a098: stop_loop+0x68/0xd0 sp: c0000000008f3a90 msr: 9000000000081001 current = 0xc0000000008a1080 paca = 0xc00000000ffd0000 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/0 Instead of going to sleep after recovery, do the usual idle wakeup and state restoration by calling into the normal idle wakeup path. This reuses the normal idle wakeup paths. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-23 20:31:46 +10:00
Nicholas Piggin	544686cae8	powerpc/64s: Stop using bit in HSPRG0 to test winkle The POWER8 idle code has a neat trick of programming the power on engine to restore a low bit into HSPRG0, so idle wakeup code can test and see if it has been programmed this way and therefore lost all state. Restore time can be reduced if winkle has not been reached. However this messes with our r13 PACA pointer, and requires HSPRG0 to be written to. It also optimizes the slowest and most uncommon case at the expense of another SPR write in the common nap state wakeup. Remove this complexity and assume winkle sleeps always require a state restore. This speedup could be made entirely contained within the winkle idle code by counting per-core winkles and setting a thread bitmap when all have gone to winkle. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-23 20:31:39 +10:00
Nicholas Piggin	2563a70c3b	powerpc/64s: Remove unnecessary relocation branch from idle handler The system reset idle handler system_reset_idle_common is relocated, so relocation is not required to branch to kvm_start_guest. The superfluous relocation does not result in incorrect code, but it does not compile outside of exception-64s.S (with fixed section definitions). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-23 17:26:35 +10:00
Oliver O'Halloran	f855b2f544	powerpc/mm: Wire up ioremap_cache() The default implementation of ioremap_cache() is aliased to ioremap(). On powerpc ioremap() creates cache-inhibited mappings by default which is almost certainly not what you wanted. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-21 21:08:47 +10:00
Naveen N. Rao	49e0b4658f	kprobes: Convert kprobe_lookup_name() to a function The macro is now pretty long and ugly on powerpc. In the light of further changes needed here, convert it to a __weak variant to be over-ridden with a nicer looking function. Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-20 23:18:54 +10:00
Nicholas Piggin	a050d20d02	powerpc/64s: Use relon prolog for EXC_VIRT_OOL_MASKABLE_HV handlers Hypervisor Virtualization and Directed Hypervisor Doorbell interrupt handlers use the macro EXC_VIRT_OOL_MASKABLE_HV for their relocation-on handlers, which calls MASKABLE_RELON_EXCEPTION_HV_OOL, which uses the real mode interrupt prolog. This means we needlessly rfid from virtual mode to virtual mode. For POWER8 it only affects doorbell IPIs. Context switch microbenchmark between threads with snooze disabled (which causes IPI) gets about 3% faster, about 370 cycles. Should be more important on POWER9 with global doorbells and HVI for host interrupts. Use the RELON variant instead to reduce overhead. Fixes: `1707dd1613` ("powerpc: Save CFAR before branching in interrupt entry paths") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Fold some more detail into the change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-20 17:05:11 +10:00
Nicholas Piggin	ca80d5d0a8	powerpc/64s: Remove SAO feature from Power9 DD1 Power9 DD1 does not implement SAO. Although it's not widely used, its presence or absence is visible to user space via arch_validate_prot() so it's moderately important that we get the value right. Fixes: `7dccfbc325` ("powerpc/book3s: Add a cpu table entry for different POWER9 revs") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-19 20:48:25 +10:00
Nicholas Piggin	2384d2d7ad	powerpc/64s: Remove ICSWX feature from Power9 Power9 does not implement the icswx instruction. This CPU feature is not visible to userspace and is only used in the CONFIG_PPC_ICSWX code, which is generally not enabled, and can only be triggered by other code using icswx, which should not happen on Power9 systems in the first place. So impact should be minimal. Fixes: `c3ab300ea5` ("powerpc: Add POWER9 cputable entry") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-19 20:21:50 +10:00
Madhavan Srinivasan	170a315f41	powerpc/perf: Support to export MMCRA[TEC*] field to userspace Threshold feature when used with MMCRA [Threshold Event Counter Event], MMCRA[Threshold Start event] and MMCRA[Threshold End event] will update MMCRA[Threashold Event Counter Exponent] and MMCRA[Threshold Event Counter Multiplier] with the corresponding threshold event count values. Patch to export MMCRA[TECX/TECM] to userspace in 'weight' field of struct perf_sample_data. Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-19 20:00:22 +10:00
Madhavan Srinivasan	79e96f8f93	powerpc/perf: Export memory hierarchy info to user space The LDST field and DATA_SRC in SIER identifies the memory hierarchy level (eg: L1, L2 etc), from which a data-cache miss for a marked instruction was satisfied. Use the 'perf_mem_data_src' object to export this hierarchy level to user space. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-19 20:00:21 +10:00
Michael Ellerman	590c369e7e	powerpc: Drop include of linux/io.h from asm/io.h Currently powerpc's asm/io.h includes linux/io.h, and linux/io.h includes asm/io.h. This can cause problems because depending on which is included first the order of definitions between the two files will change. The include of linux/io.h was added back in 2008 in commit `b41e5fffe8` ("[POWERPC] devres: Add devm_ioremap_prot()"). It's not entirely clear it was needed then, but devm_ioremap_prot() has since been removed entirely as unused, in `dedd24a12f` ("powerpc: Remove unused devm_ioremap_prot()"). So it seems to be unnecessary and can potentially cause problems, so remove the include of linux/io.h from asm/io.h Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-13 23:34:35 +10:00
Nicholas Piggin	6b3edefefa	powerpc/powernv: POWER9 support for msgsnd/doorbell IPI POWER9 requires msgsync for receiver-side synchronization, and a DD1 workaround restricts IPIs to core-local. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Drop no longer needed asm feature macro changes] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-13 23:34:34 +10:00
Nicholas Piggin	a5adf28246	powerpc/64s: Avoid a branch for ppc_msgsnd IPIs are a pretty hot path and we already have the ability to do asm feature patching, so use it. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Change log detail] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-13 23:34:34 +10:00
Nicholas Piggin	b87ac02183	powerpc: Introduce msgsnd/doorbell barrier primitives POWER9 changes requirements and adds new instructions for synchronization. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-13 23:34:33 +10:00
Nicholas Piggin	b866cc2199	powerpc: Change the doorbell IPI calling convention Change the doorbell callers to know about their msgsnd addressing, rather than have them set a per-cpu target data tag at boot that gets sent to the cause_ipi functions. The data is only used for doorbell IPI functions, no other IPI types, so it makes sense to keep that detail local to doorbell. Have the platform code understand doorbell IPIs, rather than the interrupt controller code understand them. Platform code can look at capabilities it has available and decide which to use. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-13 23:34:33 +10:00
Nicholas Piggin	9b7ff0c658	powerpc/64s: Add SCV FSCR bit for ISA v3.0 Add the bit definition and use it in facility_unavailable_exception() so we can intelligently report the cause if we take a fault for SCV. This doesn't actually enable SCV. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Drop whitespace changes to the existing entries, flush out change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-13 23:34:32 +10:00
Balbir Singh	9c355917fc	powerpc/tracing: Allow tracing of mmap syscalls Currently sys_mmap() and sys_mmap2() (32-bit only), are not visible to the syscall tracing machinery. This means users are not able to see the execution of mmap() syscalls using the syscall tracer. Fix that by using SYSCALL_DEFINE6 for sys_mmap() and sys_mmap2() so that the meta-data associated with these syscalls is visible to the syscall tracer. A side-effect of this change is that the return type has changed from unsigned long to long. However this should have no effect, the only code in the kernel which uses the result of these syscalls is in the syscall return path, which is written in asm and treats the result as unsigned regardless. Example output: cat-3399 [001] .... 196.542410: sys_mmap(addr: 7fff922a0000, len: 20000, prot: 3, flags: 812, fd: 3, offset: 1b0000) cat-3399 [001] .... 196.542443: sys_mmap -> 0x7fff922a0000 cat-3399 [001] .... 196.542668: sys_munmap(addr: 7fff922c0000, len: 6d2c) cat-3399 [001] .... 196.542677: sys_munmap -> 0x0 Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Massage change log, add detail on return type change] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-12 22:32:43 +10:00
Michael Ellerman	03dfee6d5f	powerpc/mm: Fix swapper_pg_dir size on 64-bit hash w/64K pages Recently in commit `f6eedbba7a` ("powerpc/mm/hash: Increase VA range to 128TB"), we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong. The end result is that the PGD (Page Global Directory, ie top level page table) of the kernel (aka. swapper_pg_dir), is too small. This generally doesn't lead to a crash, as we don't use the full range in normal operation. However if we try to dump the kernel pagetables we can trigger a crash because we walk off the end of the pgd into other memory and eventually try to dereference something bogus: $ cat /sys/kernel/debug/kernel_pagetables Unable to handle kernel paging request for data at address 0xe8fece0000000000 Faulting instruction address: 0xc000000000072314 cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890] pc: c000000000072314: ptdump_show+0x164/0x430 lr: c000000000072550: ptdump_show+0x3a0/0x430 dar: e802cf0000000000 seq_read+0xf8/0x560 full_proxy_read+0x84/0xc0 __vfs_read+0x6c/0x1d0 vfs_read+0xbc/0x1b0 SyS_read+0x6c/0x110 system_call+0x38/0xfc The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move the calculation into asm-offsets.c where we can do it easily using max(). Fixes: `f6eedbba7a` ("powerpc/mm/hash: Increase VA range to 128TB") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-12 22:32:43 +10:00
Michael Ellerman	3c19d5ada1	Merge branch 'topic/xive' (early part) into next This merges the arch part of the XIVE support, leaving the final commit with the KVM specific pieces dangling on the branch for Paul to merge via the kvm-ppc tree.	2017-04-12 22:31:37 +10:00
Gautham R. Shenoy	17ed4c8f81	powerpc/powernv: Recover correct PACA on wakeup from a stop on P9 DD1 POWER9 DD1.0 hardware has a bug where the SPRs of a thread waking up from stop 0,1,2 with ESL=1 can endup being misplaced in the core. Thus the HSPRG0 of a thread waking up from can contain the paca pointer of its sibling. This patch implements a context recovery framework within threads of a core, by provisioning space in paca_struct for saving every sibling threads's paca pointers. Basically, we should be able to arrive at the right paca pointer from any of the thread's existing paca pointer. At bootup, during powernv idle-init, we save the paca address of every CPU in each one its siblings paca_struct in the slot corresponding to this CPU's index in the core. On wakeup from a stop, the thread will determine its index in the core from the TIR register and recover its PACA pointer by indexing into the correct slot in the provisioned space in the current PACA. Furthermore, ensure that the NVGPRs are restored from the stack on the way out by setting the NAPSTATELOST in paca. [Changelog written with inputs from svaidy@linux.vnet.ibm.com] Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Call it a bug] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-11 08:45:09 +10:00
Gautham R. Shenoy	a7cd88da97	powerpc/powernv: Move CPU-Offline idle state invocation from smp.c to idle.c Move the piece of code in powernv/smp.c::pnv_smp_cpu_kill_self() which transitions the CPU to the deepest available platform idle state to a new function named pnv_cpu_offline() in powernv/idle.c. The rationale behind this code movement is that the data required to determine the deepest available platform state resides in powernv/idle.c. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-11 08:45:09 +10:00
Anshuman Khandual	2c9faa7675	powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes Add user space exported API definitions for 512KB, 1MB, 2MB, 8MB, 16MB, 1GB, 16GB non default huge page sizes to be used with mmap() system call. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> [mpe: Reword the comment to emphasise that these are only needed to use the non-default huge page size, and updated the change log.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-11 07:46:06 +10:00
Michael Ellerman	7644d5819c	powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there powerpc_debugfs_root is the dentry representing the root of the "powerpc" directory tree in debugfs. Currently it sits in asm/debug.h, a long with some other things that have "debug" in the name, but are otherwise unrelated. Pull it out into a separate header, which also includes linux/debugfs.h, and convert all the users to include debugfs.h instead of debug.h. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-11 07:46:03 +10:00
Benjamin Herrenschmidt	08a1e650cc	powerpc: Fixup LPCR:PECE and HEIC setting on POWER9 We need to set LPES in order for normal external interrupts (0x500) to be directed to the guest while running in guest state. We also need HEIC set to prevent them to be sent to the host while in host state. With XIVE the host never gets one of these and wouldn't know how to handle it. All host external interrupts come in via the new hypervisor virtualization interrupts vector. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-10 21:43:17 +10:00
Benjamin Herrenschmidt	d381d7caf8	powerpc: Consolidate variants of real-mode MMIOs We have all sort of variants of MMIO accessors for the real mode instructions. This creates a clean set of accessors based on Linux normal naming conventions, replacing all occurrences of the old ones in the tree. I have purposefully removed the "out/in" variants in favor of only including __raw variants. Any code using these is already pretty much hand tuned to operate in a very specific environment. I've fixed up the 2 users (only one of them actually needed a barrier in the first place). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-10 21:43:16 +10:00
Benjamin Herrenschmidt	f50d6bd344	powerpc/kvm: Remove obsolete kvm_vm_ioctl_xics_irq declaration The function doesn't exist anymore Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-10 21:43:16 +10:00
Benjamin Herrenschmidt	936774cd3f	powerpc/kvm: Make kvmppc_xics_create_icp static It's only used within the same file it's defined Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-10 21:43:15 +10:00
Benjamin Herrenschmidt	243e25112d	powerpc/xive: Native exploitation of the XIVE interrupt controller The XIVE interrupt controller is the new interrupt controller found in POWER9. It supports advanced virtualization capabilities among other things. Currently we use a set of firmware calls that simulate the old "XICS" interrupt controller but this is fairly inefficient. This adds the framework for using XIVE along with a native backend which OPAL for configuration. Later, a backend allowing the use in a KVM or PowerVM guest will also be provided. This disables some fast path for interrupts in KVM when XIVE is enabled as these rely on the firmware emulation code which is no longer available when the XIVE is used natively by Linux. A latter patch will make KVM also directly exploit the XIVE, thus recovering the lost performance (and more). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Fixup pr_xxx("XIVE:"...), don't split pr_xxx() strings, tweak Kconfig so XIVE_NATIVE selects XIVE and depends on POWERNV, fix build errors when SMP=n, fold in fixes from Ben: Don't call cpu_online() on an invalid CPU number Fix irq target selection returning out of bounds cpu# Extra sanity checks on cpu numbers ] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-10 21:41:34 +10:00
Benjamin Herrenschmidt	a978e13965	powerpc/smp: Remove migrate_irq() custom implementation Some powerpc platforms use this to move IRQs away from a CPU being unplugged. This function has several bugs such as not taking the right locks or failing to NULL check pointers. There's a new generic function doing exactly the same thing without all the bugs, so let's use it instead. mpe: The obvious place for the select of GENERIC_IRQ_MIGRATION is on HOTPLUG_CPU, but that doesn't work. On some configs PM_SLEEP_SMP will select HOTPLUG_CPU even though its dependencies are not met, which means the select of GENERIC_IRQ_MIGRATION doesn't happen. That leads to the build breaking. Fix it by moving the select of GENERIC_IRQ_MIGRATION to SMP. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-07 12:01:27 +10:00
Benjamin Herrenschmidt	14d4ae5c4c	powerpc: Add optional smp_ops->prepare_cpu SMP callback Some platforms (will) need to perform allocations before bringing a new CPU online. Doing it from smp_ops->setup_cpu is the wrong thing to do: - It has no useful failure path (too late) - Calling any allocator will enable interrupts prematurely causing problems with large decrementer among others Instead, add a new callback that is called from __cpu_up (so from the context trying to online the new CPU) at a point where we can safely allocate and handle failures. This will be used by XIVE support. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-06 19:58:53 +10:00
Benjamin Herrenschmidt	22bd64a621	powerpc: Add more PPC bit conversion macros Add 32 and 8 bit variants Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-06 19:58:53 +10:00
Benjamin Herrenschmidt	eeea1a434d	powerpc/powernv: Add XIVE related definitions to opal-api.h Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-06 19:58:46 +10:00
Alistair Popple	1ab66d1fba	powerpc/powernv: Introduce address translation services for Nvlink2 Nvlink2 supports address translation services (ATS) allowing devices to request address translations from an mmu known as the nest MMU which is setup to walk the CPU page tables. To access this functionality certain firmware calls are required to setup and manage hardware context tables in the nvlink processing unit (NPU). The NPU also manages forwarding of TLB invalidates (known as address translation shootdowns/ATSDs) to attached devices. This patch exports several methods to allow device drivers to register a process id (PASID/PID) in the hardware tables and to receive notification of when a device should stop issuing address translation requests (ATRs). It also adds a fault handler to allow device drivers to demand fault pages in. Signed-off-by: Alistair Popple <alistair@popple.id.au> [mpe: Fix up comment formatting, use flush_tlb_mm()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-04 13:27:26 +10:00
Michael Ellerman	63f44d6514	powerpc/book3s: Print task info if we take a machine check in user mode For an MCE (Machine Check Exception) that hits while in user mode MSR(PR=1), print the task info to the console MCE error log. This may help to identify an application that triggered the MCE. After this patch the MCE console looks like: Severe Machine check interrupt [Recovered] NIP: [0000000010039778] PID: 762 Comm: ebizzy Initiator: CPU Error type: SLB [Multihit] Effective address: 0000000010039778 Severe Machine check interrupt [Not recovered] NIP: [0000000010039778] PID: 763 Comm: ebizzy Initiator: CPU Error type: UE [Page table walk ifetch] Effective address: 0000000010039778 ebizzy[763]: unhandled signal 7 at 0000000010039778 nip 0000000010039778 lr 0000000010001b44 code 30004 Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-03 16:12:00 +10:00
Aneesh Kumar K.V	f4ea6dcb08	powerpc/mm: Enable mappings above 128TB Not all user space application is ready to handle wide addresses. It's known that at least some JIT compilers use higher bits in pointers to encode their information. It collides with valid pointers with 512TB addresses and leads to crashes. To mitigate this, we are not going to allocate virtual address space above 128TB by default. But userspace can ask for allocation from full address space by specifying hint address (with or without MAP_FIXED) above 128TB. If hint address set above 128TB, but MAP_FIXED is not specified, we try to look for unmapped area by specified address. If it's already occupied, we look for unmapped area in full address space, rather than from 128TB window. This approach helps to easily make application's memory allocator aware about large address space without manually tracking allocated virtual address space. This is going to be a per mmap decision. ie, we can have some mmaps with larger addresses and other that do not. A sample memory layout looks like: 10000000-10010000 r-xp 00000000 fc:00 9057045 /home/max_addr_512TB 10010000-10020000 r--p 00000000 fc:00 9057045 /home/max_addr_512TB 10020000-10030000 rw-p 00010000 fc:00 9057045 /home/max_addr_512TB 10029630000-10029660000 rw-p 00000000 00:00 0 [heap] 7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0 7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so 7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so 7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so 7fff83690000-7fff836a0000 rw-p 00000000 00:00 0 7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0 [vdso] 7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so 7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so 7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so 7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0 [stack] 1000000000000-1000000010000 rw-p 00000000 00:00 0 1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-01 21:12:29 +11:00
Aneesh Kumar K.V	82228e362f	powerpc/pseries: Skip using reserved virtual address range Now that we use all the available virtual address range, we need to make sure we don't generate VSID such that it overlaps with the reserved vsid range. Reserved vsid range include the virtual address range used by the adjunct partition and also the VRMA virtual segment. We find the context value that can result in generating such a VSID and reserve it early in boot. We don't look at the adjunct range, because for now we disable the adjunct usage in a Linux LPAR via CAS interface. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Rewrite hash__reserve_context_id(), move the rest into pseries] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-01 21:12:27 +11:00
Aneesh Kumar K.V	bb1832217a	powerpc/mm/hash: Store addr_limit in PACA We optmize the slice page size array copy to paca by copying only the range based on addr_limit. This will require us to not look at page size array beyond addr_limit in PACA on slb fault. To enable that copy task size to paca which will be used during slb fault. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Rename from task_size to addr_limit, consolidate #ifdefs] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-01 21:12:27 +11:00
Aneesh Kumar K.V	957b778a16	powerpc/mm: Add addr_limit to mm_context and use it to derive max slice index In the followup patch, we will increase the slice array size to handle 512TB range, but will limit the max addr to 128TB. Avoid doing unnecessary computation and avoid doing slice mask related operation above address limit. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-04-01 21:12:20 +11:00
Aneesh Kumar K.V	f6eedbba7a	powerpc/mm/hash: Increase VA range to 128TB We update the hash linux page table layout such that we can support 512TB. But we limit the TASK_SIZE to 128TB. We can switch to 128TB by default without conditional because that is the max virtual address supported by other architectures. We will later add a mechanism to on-demand increase the application's effective address range to 512TB. Having the page table layout changed to accommodate 512TB makes testing large memory configuration easier with less code changes to kernel Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-03-31 23:10:01 +11:00
Aneesh Kumar K.V	59248aecc4	powerpc/mm/hash: Convert mask to unsigned long This doesn't have any functional change. But helps in avoiding mistakes in case the shift bit changes Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2017-03-31 23:10:00 +11:00

1 2 3 4 5 ...

3803 Commits