Pull cpu hotplug updates from Thomas Gleixner:
"This is the first part of the ongoing cpu hotplug rework:
- Initial implementation of the state machine
- Runs all online and prepare down callbacks on the plugged cpu and
not on some random processor
- Replaces busy loop waiting with completions
- Adds tracepoints so the states can be followed"
More detailed commentary on this work from an earlier email:
"What's wrong with the current cpu hotplug infrastructure?
- Asymmetry
The hotplug notifier mechanism is asymmetric versus the bringup and
teardown. This is mostly caused by the notifier mechanism.
- Largely undocumented dependencies
While some notifiers use explicitely defined notifier priorities,
we have quite some notifiers which use numerical priorities to
express dependencies without any documentation why.
- Control processor driven
Most of the bringup/teardown of a cpu is driven by a control
processor. While it is understandable, that preperatory steps,
like idle thread creation, memory allocation for and initialization
of essential facilities needs to be done before a cpu can boot,
there is no reason why everything else must run on a control
processor. Before this patch series, bringup looks like this:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring the rest up
- All or nothing approach
There is no way to do partial bringups. That's something which is
really desired because we waste e.g. at boot substantial amount of
time just busy waiting that the cpu comes to life. That's stupid
as we could very well do preparatory steps and the initial IPI for
other cpus and then go back and do the necessary low level
synchronization with the freshly booted cpu.
- Minimal debuggability
Due to the notifier based design, it's impossible to switch between
two stages of the bringup/teardown back and forth in order to test
the correctness. So in many hotplug notifiers the cancel
mechanisms are either not existant or completely untested.
- Notifier [un]registering is tedious
To [un]register notifiers we need to protect against hotplug at
every callsite. There is no mechanism that bringup/teardown
callbacks are issued on the online cpus, so every caller needs to
do it itself. That also includes error rollback.
What's the new design?
The base of the new design is a symmetric state machine, where both
the control processor and the booting/dying cpu execute a well
defined set of states. Each state is symmetric in the end, except
for some well defined exceptions, and the bringup/teardown can be
stopped and reversed at almost all states.
So the bringup of a cpu will look like this in the future:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring itself up
The synchronization step does not require the control cpu to wait.
That mechanism can be done asynchronously via a worker or some
other mechanism.
The teardown can be made very similar, so that the dying cpu cleans
up and brings itself down. Cleanups which need to be done after
the cpu is gone, can be scheduled asynchronously as well.
There is a long way to this, as we need to refactor the notion when a
cpu is available. Today we set the cpu online right after it comes
out of the low level bringup, which is not really correct.
The proper mechanism is to set it to available, i.e. cpu local
threads, like softirqd, hotplug thread etc. can be scheduled on that
cpu, and once it finished all booting steps, it's set to online, so
general workloads can be scheduled on it. The reverse happens on
teardown. First thing to do is to forbid scheduling of general
workloads, then teardown all the per cpu resources and finally shut it
off completely.
This patch series implements the basic infrastructure for this at the
core level. This includes the following:
- Basic state machine implementation with well defined states, so
ordering and prioritization can be expressed.
- Interfaces to [un]register state callbacks
This invokes the bringup/teardown callback on all online cpus with
the proper protection in place and [un]installs the callbacks in
the state machine array.
For callbacks which have no particular ordering requirement we have
a dynamic state space, so that drivers don't have to register an
explicit hotplug state.
If a callback fails, the code automatically does a rollback to the
previous state.
- Sysfs interface to drive the state machine to a particular step.
This is only partially functional today. Full functionality and
therefor testability will be achieved once we converted all
existing hotplug notifiers over to the new scheme.
- Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
processor:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
wait for boot
bring itself up
Signal completion to control cpu
In a previous step of this work we've done a full tree mechanical
conversion of all hotplug notifiers to the new scheme. The balance
is a net removal of about 4000 lines of code.
This is not included in this series, as we decided to take a
different approach. Instead of mechanically converting everything
over, we will do a proper overhaul of the usage sites one by one so
they nicely fit into the symmetric callback scheme.
I decided to do that after I looked at the ugliness of some of the
converted sites and figured out that their hotplug mechanism is
completely buggered anyway. So there is no point to do a
mechanical conversion first as we need to go through the usage
sites one by one again in order to achieve a full symmetric and
testable behaviour"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
cpu/hotplug: Document states better
cpu/hotplug: Fix smpboot thread ordering
cpu/hotplug: Remove redundant state check
cpu/hotplug: Plug death reporting race
rcu: Make CPU_DYING_IDLE an explicit call
cpu/hotplug: Make wait for dead cpu completion based
cpu/hotplug: Let upcoming cpu bring itself fully up
arch/hotplug: Call into idle with a proper state
cpu/hotplug: Move online calls to hotplugged cpu
cpu/hotplug: Create hotplug threads
cpu/hotplug: Split out the state walk into functions
cpu/hotplug: Unpark smpboot threads from the state machine
cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
cpu/hotplug: Implement setup/removal interface
cpu/hotplug: Make target state writeable
cpu/hotplug: Add sysfs state interface
cpu/hotplug: Hand in target state to _cpu_up/down
cpu/hotplug: Convert the hotplugged cpu work to a state machine
cpu/hotplug: Convert to a state machine for the control processor
cpu/hotplug: Add tracepoints
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:
- Make schedstats a runtime tunable (disabled by default) and
optimize it via static keys.
As most distributions enable CONFIG_SCHEDSTATS=y due to its
instrumentation value, this is a nice performance enhancement.
(Mel Gorman)
- Implement 'simple waitqueues' (swait): these are just pure
waitqueues without any of the more complex features of full-blown
waitqueues (callbacks, wake flags, wake keys, etc.). Simple
waitqueues have less memory overhead and are faster.
Use simple waitqueues in the RCU code (in 4 different places) and
for handling KVM vCPU wakeups.
(Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
Marcelo Tosatti)
- sched/numa enhancements (Rik van Riel)
- NOHZ performance enhancements (Rik van Riel)
- Various sched/deadline enhancements (Steven Rostedt)
- Various fixes (Peter Zijlstra)
- ... and a number of other fixes, cleanups and smaller enhancements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
sched/cputime: Fix steal_account_process_tick() to always return jiffies
sched/deadline: Remove dl_new from struct sched_dl_entity
Revert "kbuild: Add option to turn incompatible pointer check into error"
sched/deadline: Remove superfluous call to switched_to_dl()
sched/debug: Fix preempt_disable_ip recording for preempt_disable()
sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
time, acct: Drop irq save & restore from __acct_update_integrals()
acct, time: Change indentation in __acct_update_integrals()
sched, time: Remove non-power-of-two divides from __acct_update_integrals()
sched/rt: Kick RT bandwidth timer immediately on start up
sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
sched/debug: Move sched_domain_sysctl to debug.c
sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
sched/rt: Fix PI handling vs. sched_setscheduler()
sched/core: Remove duplicated sched_group_set_shares() prototype
sched/fair: Consolidate nohz CPU load update code
sched/fair: Avoid using decay_load_missed() with a negative value
sched/deadline: Always calculate end of period on sched_yield()
sched/cgroup: Fix cgroup entity load tracking tear-down
rcu: Use simple wait queues where possible in rcutree
...
Pull locking changes from Ingo Molnar:
"Various updates:
- Futex scalability improvements: remove page lock use for shared
futex get_futex_key(), which speeds up 'perf bench futex hash'
benchmarks by over 40% on a 60-core Westmere. This makes anon-mem
shared futexes perform close to private futexes. (Mel Gorman)
- lockdep hash collision detection and fix (Alfredo Alvarez
Fernandez)
- lockdep testing enhancements (Alfredo Alvarez Fernandez)
- robustify lockdep init by using hlists (Andrew Morton, Andrey
Ryabinin)
- mutex and csd_lock micro-optimizations (Davidlohr Bueso)
- small x86 barriers tweaks (Michael S Tsirkin)
- qspinlock updates (Waiman Long)"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/csd_lock: Use smp_cond_acquire() in csd_lock_wait()
locking/csd_lock: Explicitly inline csd_lock*() helpers
futex: Replace barrier() in unqueue_me() with READ_ONCE()
locking/lockdep: Detect chain_key collisions
locking/lockdep: Prevent chain_key collisions
tools/lib/lockdep: Fix link creation warning
tools/lib/lockdep: Add tests for AA and ABBA locking
tools/lib/lockdep: Add userspace version of READ_ONCE()
tools/lib/lockdep: Fix the build on recent kernels
locking/qspinlock: Move __ARCH_SPIN_LOCK_UNLOCKED to qspinlock_types.h
locking/mutex: Allow next waiter lockless wakeup
locking/pvqspinlock: Enable slowpath locking count tracking
locking/qspinlock: Use smp_cond_acquire() in pending code
locking/pvqspinlock: Move lock stealing count tracking code into pv_queued_spin_steal_lock()
locking/mcs: Fix mcs_spin_lock() ordering
futex: Remove requirement for lock_page() in get_futex_key()
futex: Rename barrier references in ordering guarantees
locking/atomics: Update comment about READ_ONCE() and structures
locking/lockdep: Eliminate lockdep_init()
locking/lockdep: Convert hash tables to hlists
...
Pull ram resource handling changes from Ingo Molnar:
"Core kernel resource handling changes to support NVDIMM error
injection.
This tree introduces a new I/O resource type, IORESOURCE_SYSTEM_RAM,
for System RAM while keeping the current IORESOURCE_MEM type bit set
for all memory-mapped ranges (including System RAM) for backward
compatibility.
With this resource flag it no longer takes a strcmp() loop through the
resource tree to find "System RAM" resources.
The new resource type is then used to extend ACPI/APEI error injection
facility to also support NVDIMM"
* 'core-resources-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ACPI/EINJ: Allow memory error injection to NVDIMM
resource: Kill walk_iomem_res()
x86/kexec: Remove walk_iomem_res() call with GART type
x86, kexec, nvdimm: Use walk_iomem_res_desc() for iomem search
resource: Add walk_iomem_res_desc()
memremap: Change region_intersects() to take @flags and @desc
arm/samsung: Change s3c_pm_run_res() to use System RAM type
resource: Change walk_system_ram() to use System RAM type
drivers: Initialize resource entry to zero
xen, mm: Set IORESOURCE_SYSTEM_RAM to System RAM
kexec: Set IORESOURCE_SYSTEM_RAM for System RAM
arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM
ia64: Set System RAM type and descriptor
x86/e820: Set System RAM type and descriptor
resource: Add I/O resource descriptor
resource: Handle resource flags properly
resource: Add System RAM resource type
is a bit larger because the surrounding code needed a cleanup, but
nothing worrisome.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJW4UwZAAoJEL/70l94x66DG3YH/0PfUr4sW0jnWRVXmYlPVka4
sNFYrdtYnx08PwXu2sWMm1F+OBXlF/t0ZSJXJ9OBF8WdKIu8TU4yBOINRAvGO/oE
slrivjktLTKgicTtIXP5BpRR14ohwHIGcuiIlppxvnhmQz1/rMtig7fvhZxYI545
lJyIbyquNR86tiVdUSG9/T9+ulXXXCvOspYv8jPXZx7VKBXKTvp5P5qavSqciRb+
O9RqY+GDCR/5vrw+MV0J7H9ZydeEJeD02LcWguTGMATTm0RCrhydvSbou42UcKfY
osWii0kwt2LhcM/sTOz+cWnLJ6gwU9T+ZtJTTbLvYWXWDLP/+icp9ACMkwNciNo=
=/y4V
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"A few simple fixes for ARM, x86, PPC and generic code.
The x86 MMU fix is a bit larger because the surrounding code needed a
cleanup, but nothing worrisome"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0
KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
kvm: cap halt polling at exactly halt_poll_ns
KVM: s390: correct fprs on SIGP (STOP AND) STORE STATUS
KVM: VMX: disable PEBS before a guest entry
KVM: PPC: Book3S HV: Sanitize special-purpose register values on guest exit
The fork of a process with four page table levels is broken since
git commit 6252d702c5 "[S390] dynamic page tables."
All new mm contexts are created with three page table levels and
an asce limit of 4TB. If the parent has four levels dup_mmap will
add vmas to the new context which are outside of the asce limit.
The subsequent call to copy_page_range will walk the three level
page table structure of the new process with non-zero pgd and pud
indexes. This leads to memory clobbers as the pgd_index *and* the
pud_index is added to the mm->pgd pointer without a pgd_deref
in between.
The init_new_context() function is selecting the number of page
table levels for a new context. The function is used by mm_init()
which in turn is called by dup_mm() and mm_alloc(). These two are
used by fork() and exec(). The init_new_context() function can
distinguish the two cases by looking at mm->context.asce_limit,
for fork() the mm struct has been copied and the number of page
table levels may not change. For exec() the mm_alloc() function
set the new mm structure to zero, in this case a three-level page
table is created as the temporary stack space is located at
STACK_TOP_MAX = 4TB.
This fixes CVE-2016-2143.
Reported-by: Marcin Kościelnicki <koriakin@0x04.net>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With MACHINE_HAS_VX, we convert the floating point registers from the
vector registeres when storing the status. For other VCPUs, these are
stored to vcpu->run->s.regs.vrs, but we are using current->thread.fpu.vxrs,
which resolves to the currently loaded VCPU.
So kvm_s390_store_status_unloaded() currently writes the wrong floating
point registers (converted from the vector registers) when called from
another VCPU on a z13.
This is only the case for old user space not handling SIGP STORE STATUS and
SIGP STOP AND STORE STATUS, but relying on the kernel implementation. All
other calls come from the loaded VCPU via kvm_s390_store_status().
Fixes: 9abc2a08a7 (KVM: s390: fix memory overwrites when vx is disabled)
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # v4.4+
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
we have to check bit 40 of the facility list before issuing LPP
and not bit 48. Otherwise a guest running on a system with
"The decimal-floating-point zoned-conversion facility" and without
the "The set-program-parameters facility" might crash on an lpp
instruction.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # v4.4+
Fixes: e22cf8ca6f ("s390/cpumf: rework program parameter setting to detect guest samples")
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Let the non boot cpus call into idle with the corresponding hotplug state, so
the hotplug core can handle the further bringup. That's a first step to
convert the boot side of the hotplugged cpus to do all the synchronization
with the other side through the state machine. For now it'll only start the
hotplug thread and kick the full bringup of the cpu.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.614102639@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The problem:
On -rt, an emulated LAPIC timer instances has the following path:
1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled
This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.
The solution:
Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.
Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.
cyclictest command line:
This patch reduces the average latency in my tests from 14us to 11us.
Daniel writes:
Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
benchmark on mainline. The test was run 1000 times on
tip/sched/core 4.4.0-rc8-01134-g0905f04:
./x86-run x86/tscdeadline_latency.flat -cpu host
with idle=poll.
The test seems not to deliver really stable numbers though most of
them are smaller. Paolo write:
"Anything above ~10000 cycles means that the host went to C1 or
lower---the number means more or less nothing in that case.
The mean shows an improvement indeed."
Before:
min max mean std
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 5162.596000 2019270.084000 5824.491541 20681.645558
std 75.431231 622607.723969 89.575700 6492.272062
min 4466.000000 23928.000000 5537.926500 585.864966
25% 5163.000000 1613252.750000 5790.132275 16683.745433
50% 5175.000000 2281919.000000 5834.654000 23151.990026
75% 5190.000000 2382865.750000 5861.412950 24148.206168
max 5228.000000 4175158.000000 6254.827300 46481.048691
After
min max mean std
count 1000.000000 1000.00000 1000.000000 1000.000000
mean 5143.511000 2076886.10300 5813.312474 21207.357565
std 77.668322 610413.09583 86.541500 6331.915127
min 4427.000000 25103.00000 5529.756600 559.187707
25% 5148.000000 1691272.75000 5784.889825 17473.518244
50% 5160.000000 2308328.50000 5832.025000 23464.837068
75% 5172.000000 2393037.75000 5853.177675 24223.969976
max 5222.000000 3922458.00000 6186.720500 42520.379830
[Patch was originaly based on the swait implementation found in the -rt
tree. Daniel ported it to mainline's version and gathered the
benchmark numbers for tscdeadline_latency test.]
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
git commit 904818e2f2
"s390/kernel: introduce fpu-internal.h with fpu helper functions"
introduced the fpregs_store / fp_regs_load helper. These function
fail to save and restore the floating pointer control registers.
The effect is that the FPC is not correctly handled on signal
delivery and signal return.
Cc: stable@vger.kernel.org # 4.4
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
git commit 8070361799
"s390: add support for vector extension"
broke 31-bit compat processes in regard to signal handling.
The restore_sigregs_ext32() function is used to restore the additional
elements from the user space signal frame. Among the additional elements
are the upper registers halves for 64-bit register support for 31-bit
processes. The copy_from_user that is used to retrieve the high-gprs
array from the user stack uses an incorrect length, 8 bytes instead of
64 bytes. This causes incorrect upper register halves to get loaded.
Cc: stable@vger.kernel.org # 3.8+
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull s390 fixes from Martin Schwidefsky:
"Several bug fixes:
- There are four different stack tracers, and three of them have
bugs. For 4.5 the bugs are fixed and we prepare a cleanup patch
for the next merge window.
- Three bug fixes for the dasd driver in regard to parallel access
volumes and the new max_dev_sectors block device queue limit
- The irq restore optimization needs a fixup for memcpy_real
- The diagnose trace code has a conflict with lockdep"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/dasd: fix performance drop
s390/maccess: reduce stnsm instructions
s390/diag: avoid lockdep recursion
s390/dasd: fix refcount for PAV reassignment
s390/dasd: prevent incorrect length error under z/VM after PAV changes
s390: fix DAT off memory access, e.g. on kdump
s390/oprofile: fix address range for asynchronous stack
s390/perf_event: fix address range for asynchronous stack
s390/stacktrace: add save_stack_trace_regs()
s390/stacktrace: save full stack traces
s390/stacktrace: add missing end marker
s390/stacktrace: fix address ranges for asynchronous and panic stack
s390/stacktrace: fix save_stack_trace_tsk() for current task
Pull livepatching fixes from Jiri Kosina:
- regression (from 4.4) fix for ordering issue, introduced by an
earlier ftrace change, that broke live patching of modules.
The fix replaces the ftrace module notifier by direct call in order
to make the ordering guaranteed and well-defined. The patch, from
Jessica Yu, has been acked both by Steven and Rusty
- error message fix from Miroslav Benes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
ftrace/module: remove ftrace module notifier
livepatch: change the error message in asm/livepatch.h header files
When fixing the DAT off bug ("s390: fix DAT off memory access, e.g.
on kdump") both Christian and I missed that we can save an additional
stnsm instruction.
This saves us a couple of cycles which could improve the speed of
memcpy_real.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The diagnose tracer will indirectly call back into the lockdep code
when lockdep does not expect it (arch_spinlock). This causes lockdep
to disable itself and therefore we don't have a working lock
dependency validator anymore.
This patch effectively disables tracing of diag 0x9c and 0x44 if
lockdep is enabled. If however lockdep is enabled spinlocks are
mainly implemented using a trylock variant, which will not issue any
diag 0x9c or 0x44. So this change has hardly any effect on tracing
except when arch_spinlock and friends are explicitly used.
Reported-and-Tested-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
commit 204ee2c564 ("s390/irqflags: optimize irq restore") optimized
irqrestore to really only care about interrupts and adapted the
remaining low level users. One spot (memcpy_real) was not touched,
though - fix it. Otherwise a kdump kernel will fail while reading
the old kernel. As we re-enable irqs with a non-standard function
we have to tell lockdep about that.
Fixes: 204ee2c564 ("s390/irqflags: optimize irq restore")
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
git commit dc7ee00d47 ("s390: lowcore stack pointer offsets")
introduced a regression in regard to s390_backtrace(). The stack
pointer for the asynchronous stack in the lowcore now has an
additional offset applied. This offset needs to be taken into account
in the calculation for the low and high address for the stack.
This bug was already partially fixed with commit 9cc5c206d9
("s390/dumpstack: fix address ranges for asynchronous and panic
stack"). This patch fixes it also for the oprofile code.
Fixes: dc7ee00d47 ("s390: lowcore stack pointer offsets")
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
git commit dc7ee00d47 ("s390: lowcore stack pointer offsets")
introduced a regression in regard to perf_callchain_kernel(). The
stack pointer for the asynchronous stack in the lowcore now has an
additional offset applied. This offset needs to be taken into account
in the calculation for the low and high address for the stack.
This bug was already partially fixed with 9cc5c206d9
("s390/dumpstack: fix address ranges for asynchronous and panic
stack"). This patch fixes it also for the perf_event code.
Fixes: dc7ee00d47 ("s390: lowcore stack pointer offsets")
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Implement save_stack_trace_regs, so that a stack trace of a kprobe
event can be obtained.
Without this we see following warning:
"save_stack_trace_regs() not implemented yet."
when we execute:
echo stacktrace > /sys/kernel/debug/tracing/trace_options
echo "p kfree" >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
Reported-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Pratyush Anand <panand@redhat.com>
[heiko.carstens@de.ibm.com]: changed patch to use __save_stack_trace()
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
save_stack_trace() only saves the stack trace of the current context
(interrupt or process context). This is different to what other
architectures like x86 do, which save the full stack trace across
different contexts.
Also extract a __save_stack_trace() helper function which will be used
by a follow on patch.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
save_stack_trace() did not write the ULONG_MAX end marker if there is
enough space left. So simply follow x86 and arm64.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
git commit dc7ee00d47 ("s390: lowcore stack pointer offsets")
introduced a regression in regard to save_stack_trace(). The stack
pointer for the asynchronous and the panic stack in the lowcore now
have an additional offset applied to them. This offset needs to be
taken into account in the calculation for the low and high address for
the stacks.
This bug was already partially fixed with 9cc5c206d9
("s390/dumpstack: fix address ranges for asynchronous and panic
stack"). This patch fixes it also for the stacktrace code.
Fixes: dc7ee00d47 ("s390: lowcore stack pointer offsets")
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The function save_stack_trace_tsk() did not consider that it can be
used for tsk == current, for which the current stack pointer obviously
cannot be found in the thread structure.
Fix this and get the stack pointer with an inline assembly.
This fixes e.g. the output of "cat /proc/self/stack".
Before:
[<0000000000000000>] (null)
[<ffffffffffffffff>] 0xffffffffffffffff
After:
[<000000000011b3ee>] save_stack_trace_tsk+0x56/0x98
[<0000000000366cde>] proc_pid_stack+0xae/0x108
[<00000000003636f0>] proc_single_show+0x70/0xc0
[<0000000000311fbc>] seq_read+0xcc/0x448
[<00000000002e7716>] __vfs_read+0x36/0x100
[<00000000002e872e>] vfs_read+0x76/0x130
[<00000000002e975e>] SyS_read+0x66/0xd8
[<000000000089490e>] system_call+0xd6/0x264
[<ffffffffffffffff>] 0xffffffffffffffff
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Lockdep is initialized at compile time now. Get rid of lockdep_init().
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Krinkin <krinkin.m.u@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: mm-commits@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Set IORESOURCE_SYSTEM_RAM in flags of resource ranges with
"System RAM", "Kernel code", "Kernel data", and "Kernel bss".
Note that:
- IORESOURCE_SYSRAM (i.e. modifier bit) is set in flags when
IORESOURCE_MEM is already set. IORESOURCE_SYSTEM_RAM is defined
as (IORESOURCE_MEM|IORESOURCE_SYSRAM).
- Some archs do not set 'flags' for children nodes, such as
"Kernel code". This patch does not change 'flags' in this
case.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@linux-mips.org
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-parisc@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/1453841853-11383-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull s390 updates from Martin Schwidefsky:
"An optimization for irq-restore, the SSM instruction is quite a bit
slower than an if-statement and a STOSM.
The copy_file_range system all is added.
Cleanup for PCI and CIO.
And a couple of bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cio: update measurement characteristics
s390/cio: ensure consistent measurement state
s390/cio: fix measurement characteristics memleak
s390/zcrypt: Fix cryptographic device id in kernel messages
s390/pci: remove iomap sanity checks
s390/pci: set error state for unusable functions
s390/pci: fix bar check
s390/pci: resize iomap
s390/pci: improve ZPCI_* macros
s390/pci: provide ZPCI_ADDR macro
s390/pci: adjust IOMAP_MAX_ENTRIES
s390/numa: move numa_init_late() from device to arch_initcall
s390: remove all usages of PSW_ADDR_INSN
s390: remove all usages of PSW_ADDR_AMODE
s390: wire up copy_file_range syscall
s390: remove superfluous memblock_alloc() return value checks
s390/numa: allocate memory with correct alignment
s390/irqflags: optimize irq restore
s390/mm: use TASK_MAX_SIZE where applicable
The kernel now always uses vector registers when available, however KVM
has special logic if support is really enabled for a guest. If support
is disabled, guest_fpregs.fregs will only contain memory for the fpu.
The kernel, however, will store vector registers into that area,
resulting in crazy memory overwrites.
Simply extending that area is not enough, because the format of the
registers also changes. We would have to do additional conversions, making
the code even more complex. Therefore let's directly use one place for
the vector/fpu registers + fpc (in kvm_run). We just have to convert the
data properly when accessing it. This makes current code much easier.
Please note that vector/fpu registers are now always stored to
vcpu->run->s.regs.vrs. Although this data is visible to QEMU and
used for migration, we only guarantee valid values to user space when
KVM_SYNC_VRS is set. As that is only the case when we have vector
register support, we are on the safe side.
Fixes: b5510d9b68 ("s390/fpu: always enable the vector facility if it is available")
Cc: stable@vger.kernel.org # v4.4 d9a3a09af5 s390/kvm: remove dependency on struct save_area definition
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
[adopt to d9a3a09af5]
The KVM-VFIO device is used by the QEMU VFIO device. It is used to
record the list of in-use VFIO groups so that KVM can manipulate
them.
While we don't need this on s390 currently, let's try to be like
everyone else.
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
fprs is never freed, therefore resulting in a memory leak if
kvm_vcpu_init() fails or the vcpu is destroyed.
Fixes: 9977e886cb ("s390/kernel: lazy restore fpu registers")
Cc: stable@vger.kernel.org # v4.3+
Reported-by: Eric Farman <farman@linux.vnet.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Eric Farman <farman@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Since each iomap_entry handles only one bar of one pci function
(even when disjunct ranges of a bar are mapped) the sanity check
in pci_iomap_range is not needed and can be removed.
Also convert the remaining BUG_ONs to WARN_ONs.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
We receive special notifications from firmware when an error was detected
and a pci function became unusable. Set the error_state accordingly to give
device drivers a hint that they don't need to try error recovery.
Suggested-by: Alexander Schmidt <alexschm@de.ibm.com>
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Fix the check which bar space we should map to allow available bars only.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On s390 we need to maintain a mapping between iomem addresses
and arch specific function identifiers. Currently the mapping
table is created as such that we could span the whole iomem
address space. Since we can only map each bar space from each
possible function we have an upper bound for the number of
mapping entries.
This reduces the size of the iomap from 256K to less than 4K
(using the defconfig).
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Most of the constants defined in pci_io.h depend on each other
and thus can be calculated.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Provide and use a ZPCI_ADDR macro as the complement of ZPCI_IDX
to get rid of some constants in the code.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
ZPCI_IOMAP_MAX_ENTRIES is off by one. Let's adjust this
for the sake of correctness.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Commit 3e89e1c5ea ("hugetlb: make mm and fs code explicitly non-modular")
moves hugetlb_init() from module_init to subsys_initcall.
The hugetlb_init()->hugetlb_register_node() code accesses "node->dev.kobj"
which is initialized in numa_init_late().
Since numa_init_late() is a device_initcall which is called *after*
subsys_initcall the above mentioned patch breaks NUMA on s390.
So fix this and move numa_init_late() to arch_initcall.
Fixes: 3e89e1c5ea ("hugetlb: make mm and fs code explicitly non-modular")
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move the generic implementation to <linux/dma-mapping.h> now that all
architectures support it and remove the HAVE_DMA_ATTR Kconfig symbol now
that everyone supports them.
[valentinrothberg@gmail.com: remove leftovers in Kconfig]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Helge Deller <deller@gmx.de>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yet another leftover from the 31 bit era. The usual operation
"y = x & PSW_ADDR_INSN" with the PSW_ADDR_INSN mask is a nop for
CONFIG_64BIT.
Therefore remove all usages and hope the code is a bit less confusing.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
This is a leftover from the 31 bit area. For CONFIG_64BIT the usual
operation "y = x | PSW_ADDR_AMODE" is a nop. Therefore remove all
usages of PSW_ADDR_AMODE and make the code a bit less confusing.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
memblock_alloc() and memblock_alloc_base() will panic on their own if
they can't find free memory. Therefore remove some pointless checks.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Allocating memory with a requested minimum alignment of 1 is wrong
since pg_data_t contains a spinlock which requires an alignment of 4
bytes.
Therefore fix this and ask for an alignment of 8 bytes like it is
guarenteed for all kmalloc requests.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The ssm instruction takes longer that stnsm/stosm as it is often
used to modify DAT and PER. We know that irqsave/irqrestore only
deals with external and I/O interrupts and we know that irqrestore
can transition only from disabled->disabled or disabled->enabled,
so we can use the faster stosm.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>