vdso per cpu area allocation in smp_prepare_cpus() happens with GFP_KERNEL
but irqs disabled. Triggers this one:
Badness at kernel/lockdep.c:2280
Modules linked in:
CPU: 0 Not tainted 2.6.30 #2
Process swapper (pid: 1, task: 000000003fe88000, ksp: 000000003fe87eb8)
Krnl PSW : 0400c00180000000 0000000000083360 (lockdep_trace_alloc+0xec/0xf8)
[...]
Call Trace:
([<00000000000832b6>] lockdep_trace_alloc+0x42/0xf8)
[<00000000000b1880>] __alloc_pages_internal+0x3e8/0x5c4
[<00000000000b1b4a>] __get_free_pages+0x3a/0xb0
[<0000000000026546>] vdso_alloc_per_cpu+0x6a/0x18c
[<00000000005eff82>] smp_prepare_cpus+0x322/0x594
[<00000000005e8232>] kernel_init+0x76/0x398
[<000000000001bb1e>] kernel_thread_starter+0x6/0xc
[<000000000001bb18>] kernel_thread_starter+0x0/0xc
Fix this by moving the allocation out of the irqs disabled section.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Replace the spinlock used in the idle time accounting with a sequence
counter mechanism analog to seqlock.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch introduces the hibernation backend support to the
s390 architecture. Now it is possible to suspend a mainframe Linux
guest using the following command:
echo disk > /sys/power/state
Signed-off-by: Hans-Joachim Picht <hans@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
All definition in cpu.h have to do with cputime accounting. Move
them to cputime.h and remove the header file.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The cpu_possible_map by default is initialized with all ones in s390.
If the kernel paramert possible_cpus=<x> is passed the cpu_possible_map
is supposed to have x bits set.
However the current code just sets the x bits without clearing the NR_CPUS
bits that were already set. So we end up with an unchanged map that has
all bits set.
To fix this just clear the map before setting any new bits.
This broke with def6cfb70b
"[S390] cpumask: Use accessors code."
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Currently the storage of the machine flags is a globally exported unsigned
long long variable. By moving the storage location into the lowcore struct we
allow assembler code to check machine_flags directly even without needing a
register. Addtionally the lowcore and therefore the machine flags too will be
in cache most of the time.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Impact: use new API
Use the accessors rather than frobbing bits directly. Most of this is
in arch code I haven't even compiled, but is straightforward.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Impact: cleanup, futureproof
In fact, all cpumask ops will only be valid (in general) for bit
numbers < nr_cpu_ids. So use that instead of NR_CPUS in various
places (I also updated the immediate sites to use the new cpumask_
operators).
This is always safe: no cpu number can be >= nr_cpu_ids, and
nr_cpu_ids is initialized to NR_CPUS at boot.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Performing an initial cpu reset makes sure all registers and tlbs of
the targeted cpu are initialized and flushed.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If sigp_set_prefix fails on __cpu_up we leak the lowcore structures
and async+panic stacks for the failed cpu.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The zcore code switches to real addressing mode when creating a kernel dump.
This is not possible, if it is built as a kernel module. With this patch
zcore (zfcpdump) can't be built as a kernel module any more.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With lockdep we got the following trace after a panic:
Badness at /home/autobuild/BUILD/linux-2.6.28-20090204/kernel/lockdep.c:2878
[...]
Call Trace:
[<0000000000176334>] lock_acquire+0x54/0xbc
[<000000000050b4fe>] __atomic_notifier_call_chain+0x6e/0xdc
[<000000000050b59c>] atomic_notifier_call_chain+0x30/0x44
[<0000000000504274>] panic+0xd0/0x1e8
[...]
INFO: lockdep is turned off.
Last Breaking-Event-Address:
[<0000000000170e62>] check_flags+0xae/0x15c
possible reason: unannotated irqs-off.
lockdep is right. We missed a trace_hardirq_off in our smp_send_stop
function and smp_send_stop is called before the panic call chain.
Reported-by: Mijo <Safradin mijo@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Provide new shutdown action "dump_reipl" for automatic ipl after dump.
Signed-off-by: Frank Munzert <munzert@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
!CONFIG_SMP:
arch/s390/kernel/vdso.c: In function 'vdso_init':
arch/s390/kernel/vdso.c:325: error: incompatible type for argument 2 of 'vdso_alloc_per_cpu'
Also move the code out of the BUG_ON statement since it won't be
executed on !CONFIG_BUG. And that would be a bug.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
x86: export vector_used_by_percpu_irq
x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
sched: nominate preferred wakeup cpu, fix
x86: fix lguest used_vectors breakage, -v2
x86: fix warning in arch/x86/kernel/io_apic.c
sched: fix warning in kernel/sched.c
sched: move test_sd_parent() to an SMP section of sched.h
sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance
sched: framework for sched_mc/smt_power_savings=N
sched: convert BALANCE_FOR_xx_POWER to inline functions
x86: use possible_cpus=NUM to extend the possible cpus allowed
x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
x86: update io_apic.c to the new cpumask code
x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
x86: xen: use smp_call_function_many()
x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
...
Fixed up trivial conflict in kernel/time/tick-sched.c manually
The extract cpu time instruction (ectg) instruction allows the user
process to get the current thread cputime without calling into the
kernel. The code that uses the instruction needs to switch to the
access registers mode to get access to the per-cpu info page that
contains the two base values that are needed to calculate the current
cputime from the CPU timer with the ectg instruction.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Increase the precision of the idle time calculation that is exported
to user space via /sys/devices/system/cpu/cpu<x>/idle_time_us
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On s390 we always want to run with precise cputime accounting.
Remove the config options VIRT_TIMER and VIRT_CPU_ACCOUNTING.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Since etr/stp don't need the old smp_call_function semantics anymore
we can convert s390 to the generic IPI infrastructure.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Impact: cleanup
Each SMP arch defines these themselves. Move them to a central
location.
Twists:
1) Some archs (m32, parisc, s390) set possible_map to all 1, so we add a
CONFIG_INIT_ALL_POSSIBLE for this rather than break them.
2) mips and sparc32 '#define cpu_possible_map phys_cpu_present_map'.
Those archs simply have phys_cpu_present_map replaced everywhere.
3) Alpha defined cpu_possible_map to cpu_present_map; this is tricky
so I just manipulate them both in sync.
4) IA64, cris and m32r have gratuitous 'extern cpumask_t cpu_possible_map'
declarations.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Travis <travis@sgi.com>
Cc: ink@jurassic.park.msu.ru
Cc: rmk@arm.linux.org.uk
Cc: starvik@axis.com
Cc: tony.luck@intel.com
Cc: takata@linux-m32r.org
Cc: ralf@linux-mips.org
Cc: grundler@parisc-linux.org
Cc: paulus@samba.org
Cc: schwidefsky@de.ibm.com
Cc: lethal@linux-sh.org
Cc: wli@holomorphy.com
Cc: davem@davemloft.net
Cc: jdike@addtoit.com
Cc: mingo@redhat.com
Use sysdev_class_create_file() to create create sysdev class attributes
instead of sysfs_create_file(). Using sysfs_create_file() wasn't a very
good idea since the show and store functions have a different amount of
parameters for sysfs files and sysdev class files.
In particular the pointer to the buffer is the last argument and
therefore accesses to random memory regions happened.
Still worked surprisingly well until we got a kernel panic.
Cc: stable@kernel.org
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Right now, there is no notifier that is called on a new cpu, before the new
cpu begins processing interrupts/softirqs.
Various kernel function would need that notification, e.g. kvm works around
by calling smp_call_function_single(), rcu polls cpu_online_map.
The patch adds a CPU_STARTING notification. It also adds a helper function
that sends the message to all cpu_chain handlers.
Tested on x86-64.
All other archs are untested. Especially on sparc, I'm not sure if I got
it right.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the now unneeded s390_idle.lock spinlock initialization after
Josef Sipek did it the right way in arch/s390/kernel/process.c.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This allow to dynamically generate attributes and share show/store
functions between attributes. Right now most attributes are generated
by special macros and lots of duplicated code. With the attribute
passed it's instead possible to attach some data to the attribute
and then use that in shared low level functions to do different things.
I need this for the dynamically generated bank attributes in the x86
machine check code, but it'll allow some further cleanups.
I converted all users in tree to the new show/store prototype. It's a single
huge patch to avoid unbisectable sections.
Runtime tested: x86-32, x86-64
Compiled only: ia64, powerpc
Not compile tested/only grep converted: sh, arm, avr32
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's never used and the comments refer to nonatomic and retry
interchangably. So get rid of it.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The first argument to __ctl_store() should be the array to store
stuff in, not just the first element of that array. With the
current code in __cpu_up(), mainline GCC dies with an internal
compiler error. I didn't diagnose that further, but just fixed
the kernel bug.
Signed-off-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
This fixes the last remaining section mismatch warnings in s390
architecture code. It reveals also a real bug introduced by... me
with git commit 2069e978d5
("[S390] sparsemem vmemmap: initialize memmap.")
Calling the generic vmemmap_alloc_block() function to get initialized
memory is a nice idea, however that function is __meminit annotated
and therefore the function might be gone if we try to call it later.
This can happen if a DCSS segment gets added.
So basically revert the patch and clear the memmap explicitly to fix
the original bug.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Both smp_call_function() and __smp_call_function_map() access
cpu_online_map. Both functions run with preemption disabled which
protects for cpus going offline. However new cpus can be added and
therefore the cpu_online_map can change unexpectedly.
So use the call_lock to protect against changes to the cpu_online_map
in start_secondary() and all smp_call_* functions.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On some smp sysfs store attributes get_online_cpus() may block on
cpu_hotplug.lock, but we hold already smp_cpu_state_mutex. Since the
locking order on cpu hotplug via arch_update_cpu_topology is inverse
this might lead to deadlocks.
So make sure locking order is always the same.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Most noteable part of this commit is the new local header file entry.h
which contains all the function declarations of functions that get only
called from asm code or are arch internal. That way we can avoid extern
declarations in C files.
This is more or less the same that was done for sparc64.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Remove the program check generating monitor calls and use function
calls instead. Theres is no real advantage in using monitor calls,
but they do make debugging harder, because of all the program checks
it generates.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
If vertical cpu polarization is active then the hypervisor will
dispatch certain cpus for a longer time than other cpus for maximum
performance. For example if a guest would have three virtual cpus,
each of them with a share of 33 percent, then in case of vertical
cpu polarization all of the processing time would be combined to a
single cpu which would run all the time, while the other two cpus
would get nearly no cpu time.
There are three different types of vertical cpus: high, medium and
low. Low cpus hardly get any real cpu time, while high cpus get a
full real cpu. Medium cpus get something in between.
In order to switch between the two possible modes (default is
horizontal) a 0 for horizontal polarization or a 1 for vertical
polarization must be written to the dispatching sysfs attribute:
/sys/devices/system/cpu/dispatching
The polarization of each single cpu can be figured out by the
polarization sysfs attribute of each cpu:
/sys/devices/system/cpu/cpuX/polarization
horizontal, vertical:high, vertical:medium, vertical:low or unknown.
When switching polarization the polarization attribute may contain
the value unknown until the configuration change is done and the
kernel has figured out the new polarization of each cpu.
Note that running a system with different types of vertical cpus may
result in significant performance regressions. If possible only one
type of vertical cpus should be used. All other cpus should be
offlined.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Add s390 backend so we can give the scheduler some hints about the
cpu topology.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Compile smp.o with -Wno-nonnull so gcc stops warning about memcpy
being used with a null parameter. Also remove the workaround code
and use a char * cast instead of a void * cast to do computations.
Cc: Bastian Blank <bastian@waldi.eu.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Just copy the first 512 read-only bytes of the current cpu lowcore if
a new cpu gets onlined. The rest is zeroed out and must be explicitly
initialized. Current code just copies the entire lowcore and
initializes the needed fields.
This should reveal bugs in future enhancements quite early.
Also when the lowcore of the first cpu is replaced this is now done
atomically (no interrupts, no machine checks).
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Fix couple of section mismatches. And since we touch the code
anyway change the IPL code to use C99 initializers.
Cc: Michael Holzheu <holzheu@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make sure func isn't called on the local cpu just like on all other
architectures that implement this function.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Git commit 86ef5c9a8e forgot a few
lock_cpu_hotplug/unlock_cpu_hotplug pairs in arch/s390/kernel/smp.c
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch adds the s390 variant for smp_call_function_mask(). The
implementation is pretty straight forward using the wrapper
__smp_call_function_map() which already takes a cpumask_t argument.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
No need to preallocate the per cpu lowcores and stacks.
Savings are 28-32k per offline cpu.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
In case of a kernel panic it is currently possible to specify that a dump
should be created, the system should be rebooted or stopped. Virtual sysfs
files under the directory /sys/firmware/ are used for that configuration.
In addition to that, there are kernel parameters 'vmhalt', 'vmpoff'
and 'vmpanic', which can be used to specify z/VM commands, which are
automatically executed in case of halt, power off or a kernel panic.
This patch combines both functionalities and allows to specify the z/VM CP
commands also via sysfs attributes. In addition to that, it enhances the
existing handling of shutdown triggers (e.g. halt or panic) and associated
shutdown actions (e.g. dump or reipl) and makes it more flexible.
Signed-off-by: Michael Holzheu <holzheu@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
It caused only a lot of confusion. From now on cpu hotplug of up to
NR_CPUS will work by default. If somebody wants to limit that then
the possible_cpus parameter can be used.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add a new interface so that cpus can be put into standby state and
configured state.
Only offline cpus can be put into standby state or configured state.
For that the new percpu sysfs attribute "configure" must be used.
To put a cpu in standby state a "0" must be written to the attribute.
In order to switch it into configured state a "1" must be written to
the attribute.
Only cpus in configured state can be brought online.
In addition this patch introduces a static mapping of physical to
logical cpus. As a result only the sysfs directories of present cpus
will be created. To scan for new cpus the new sysfs attribute "rescan"
must be used.
Writing to /sys/devices/system/cpu/rescan will trigger a rescan of
cpus and will create directories for new cpus.
On IPL only configured cpus will be used. And on reboot/shutdown all
cpus will remain in their current state (configured/standby).
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Don't perform a sigp store-status-at-address on smp_send_stop().
It will overwrite the lowcores of other cpus and destroys valueable
debug informations.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Seems that people prefer to have the unit encoded in the attribute
name. Also makes parsing easier.
Now we have:
# cat /sys/devices/system/cpu/cpu0/idle_time_us
131473592
instead of
# cat /sys/devices/system/cpu/cpu0/idle_time
131473592 us
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The current tlb flushing code for page table entries violates the
s390 architecture in a small detail. The relevant section from the
principles of operation (SA22-7832-02 page 3-47):
"A valid table entry must not be changed while it is attached
to any CPU and may be used for translation by that CPU except to
(1) invalidate the entry by using INVALIDATE PAGE TABLE ENTRY or
INVALIDATE DAT TABLE ENTRY, (2) alter bits 56-63 of a page-table
entry, or (3) make a change by means of a COMPARE AND SWAP AND
PURGE instruction that purges the TLB."
That means if one thread of a multithreaded applciation uses a vma
while another thread does an unmap on it, the page table entries of
that vma needs to get removed with IPTE, IDTE or CSP. In some strange
and rare situations a cpu could check-stop (die) because a entry has
been pushed out of the TLB that is still needed to complete a
(milli-coded) instruction. I've never seen it happen with the current
code on any of the supported machines, so right now this is a
theoretical problem. But I want to fix it nevertheless, to avoid
headaches in the futures.
To get this implemented correctly without changing common code the
primitives ptep_get_and_clear, ptep_get_and_clear_full and
ptep_set_wrprotect need to use the IPTE instruction to invalidate the
pte before the new pte value gets stored. If IPTE is always used for
the three primitives three important operations will have a performace
hit: fork, mprotect and exit_mmap. Time for some workarounds:
* 1: ptep_get_and_clear_full is used in unmap_vmas to remove page
tables entries in a batched tlb gather operation. If the mmu_gather
context passed to unmap_vmas has been started with full_mm_flush==1
or if only one cpu is online or if the only user of a mm_struct is the
current process then the fullmm indication in the mmu_gather context is
set to one. All TLBs for mm_struct are flushed by the tlb_gather_mmu
call. No new TLBs can be created while the unmap is in progress. In
this case ptep_get_and_clear_full clears the ptes with a simple store.
* 2: ptep_get_and_clear is used in change_protection to clear the
ptes from the page tables before they are reentered with the new
access flags. At the end of the update flush_tlb_range clears the
remaining TLBs. In general the ptep_get_and_clear has to issue IPTE
for each pte and flush_tlb_range is a nop. But if there is only one
user of the mm_struct then ptep_get_and_clear uses simple stores
to do the update and flush_tlb_range will flush the TLBs.
* 3: Similar to 2, ptep_set_wrprotect is used in copy_page_range
for a fork to make all ptes of a cow mapping read-only. At the end of
of copy_page_range dup_mmap will flush the TLBs with a call to
flush_tlb_mm. Check for mm->mm_users and if there is only one user
avoid using IPTE in ptep_set_wrprotect and let flush_tlb_mm clear the
TLBs.
Overall for single threaded programs the tlb flush code now performs
better, for multi threaded programs it is slightly worse. In particular
exit_mmap() now does a single IDTE for the mm and then just frees every
page cache reference and every page table page directly without a delay
over the mmu_gather structure.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add two new sysfs entries per cpu: idle_count and idle_time.
idle_count contains the number of times a cpu went into idle state.
idle_time contains the time a cpu spent in idle state in microseconds.
This can be used e.g. by powertop to tell how often idle state is
entered and left.
# cat /sys/devices/system/cpu/cpu0/idle_count
504
# cat /sys/devices/system/cpu/cpu0/idle_time
469734037 us
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
There is no need to disable bottom halves when holding call_lock. Also
this could imply that it is legal to call smp_call_function* from
bh context, which it is not.
Also test if func will be executed locally before disabling
and aterwards enabling interrupts again. It's not necessary to disable
and enable interrupts each time __smp_call_function_map gets called.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
smp_call_function_single now has the same semantics as s390's
smp_call_function_on. Therefore convert to the *single variant
and get rid of some architecture specific code.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Merge smp_count_cpus() and smp_get_save_areas() so we save a loop over
all potentially present cpus.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use the __cpuinit instead of __devinit section annotations for code
that deals with cpu hotplug. In addition add some more annotations on
functions that have been left out so far.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Generate uevents for all cpus if cpu capability changes. This can
happen e.g. because the cpus are overheating. The cpu capability can
be read via /sys/devices/system/cpu/cpuN/capability.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
s390 machines provide hardware support for creating Linux dumps on SCSI
disks. For creating a dump a special purpose dump Linux is used. The first
32 MB of memory are saved by the hardware before the dump Linux is
booted. Via an SCLP interface, the saved memory can be accessed from
Linux. This patch exports memory and registers of the crashed Linux to
userspace via a debugfs file. For more information refer to
Documentation/s390/zfcpdump.txt, which is included in this patch.
Signed-off-by: Michael Holzheu <holzheu@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Avoid sprinkling a _lot_ of preempt_disable/preempt_enable pairs.
This would be necessary for e.g. the iucv driver. Also this way we
are more consistent with other architectures which disable
preemption at least for smp_call_function.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Setup.h has been misused for ipl related stuff in the past. We now move
everything, which has to do with ipl and reipl to a new header file named
"ipl.h".
Signed-off-by: Michael Holzheu <holzheu@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Introduce __smp_call_function_map which calls a function on all cpus
given with a cpumask_t. Use it to implement smp_call_function and
smp_call_function_on. Replace smp_ext_bitcall_others with smp_ext_bitcall
and a for_each_cpu_mask loop. Use a cpumask_t instead of an atomic_t for
cpu counting and print a warning if preempt is on in
__smp_call_function_map().
Signed-off-by: Jan Glauber <jan.glauber@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
smp_call_function and smp_call_function_on share the same lock and
smp_call_function_on disables softirq's so it can be called from
softirq context as well. Hence smp_call_function muss disable
softirqs as well to avoid deadlocks.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch adds support for clock synchronization to an external time
reference (ETR). The external time reference sends an oscillator
signal and a synchronization signal every 2^20 microseconds to keep
the TOD clocks of all connected servers in sync. For availability
two ETR units can be connected to a machine. If the clock deviates
for more than the sync-check tolerance all cpus get a machine check
that indicates that the clock is out of sync. For the lovely details
how to get the clock back in sync see the code below.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This provides a noexec protection on s390 hardware. Our hardware does
not have any bits left in the pte for a hw noexec bit, so this is a
different approach using shadow page tables and a special addressing
mode that allows separate address spaces for code and data.
As a special feature of our "secondary-space" addressing mode, separate
page tables can be specified for the translation of data addresses
(storage operands) and instruction addresses. The shadow page table is
used for the instruction addresses and the standard page table for the
data addresses.
The shadow page table is linked to the standard page table by a pointer
in page->lru.next of the struct page corresponding to the page that
contains the standard page table (since page->private is not really
private with the pte_lock and the page table pages are not in the LRU
list).
Depending on the software bits of a pte, it is either inserted into
both page tables or just into the standard (data) page table. Pages of
a vma that does not have the VM_EXEC bit set get mapped only in the
data address space. Any try to execute code on such a page will cause a
page translation exception. The standard reaction to this is a SIGSEGV
with two exceptions: the two system call opcodes 0x0a77 (sys_sigreturn)
and 0x0aad (sys_rt_sigreturn) are allowed. They are stored by the
kernel to the signal stack frame. Unfortunately, the signal return
mechanism cannot be modified to use an SA_RESTORER because the
exception unwinding code depends on the system call opcode stored
behind the signal stack frame.
This feature requires that user space is executed in secondary-space
mode and the kernel in home-space mode, which means that the addressing
modes need to be switched and that the noexec protection only works
for user space.
After switching the addressing modes, we cannot use the mvcp/mvcs
instructions anymore to copy between kernel and user space. A new
mvcos instruction has been added to the z9 EC/BC hardware which allows
to copy between arbitrary address spaces, but on older hardware the
page tables need to be walked manually.
Signed-off-by: Gerald Schaefer <geraldsc@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
72486f1f8f inverts the logic if an
'online' attribute in /sys/devices/system/cpu/cpuX should appear.
So we end up with no hotpluggable cpus at all...
Set the hotpluggable value to one to make sure the online
attribute appears again.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Let one master cpu kill all other cpus instead of sending an external
interrupt to all other cpus so they can kill themselves.
Simplifies reipl/shutdown functions a lot.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Major cleanup of all s390 inline assemblies. They now have a common
coding style. Quite a few have been shortened, mainly by using register
asm variables. Use of the EX_TABLE macro helps as well. The atomic ops,
bit ops and locking inlines new use the Q-constraint if a newer gcc
is used. That results in slightly better code.
Thanks to Christian Borntraeger for proof reading the changes.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
It is now possible to specify a ccw/fcp dump device which is used to
automatically create a system dump in case of a kernel panic. The dump
device can be configured under /sys/firmware/dump.
In addition it is now possible to specify a ccw/fcp device which is used
for the next reboot of Linux. The reipl device can be configured under
/sys/firmware/reipl.
Signed-off-by: Michael Holzheu <holzheu@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With Goto-san's patch, we can add new pgdat/node at runtime. I'm now
considering node-hot-add with cpu + memory on ACPI.
I found acpi container, which describes node, could evaluate cpu before
memory. This means cpu-hot-add occurs before memory hot add.
In most part, cpu-hot-add doesn't depend on node hot add. But register_cpu(),
which creates symbolic link from node to cpu, requires that node should be
onlined before register_cpu(). When a node is onlined, its pgdat should be
there.
This patch-set holds off creating symbolic link from node to cpu
until node is onlined.
This removes node arguments from register_cpu().
Now, register_cpu() requires 'struct node' as its argument. But the array of
struct node is now unified in driver/base/node.c now (By Goto's node hotplug
patch). We can get struct node in generic way. So, this argument is not
necessary now.
This patch also guarantees add cpu under node only when node is onlined. It
is necessary for node-hot-add vs. cpu-hot-add patch following this.
Moreover, register_cpu calculates cpu->node_id by cpu_to_node() without regard
to its 'struct node *root' argument. This patch removes it.
Also modify callers of register_cpu()/unregister_cpu, whose args are changed
by register-cpu-remove-node-struct patch.
[Brice.Goglin@ens-lyon.org: fix it]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Brice Goglin <Brice.Goglin@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.
We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.
This patch replaces for_each_cpu with for_each_possible_cpu.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Retry starting of new cpu if sigp restart returns condition code 2 (busy).
Signed-off-by: Michael Ryan <ryan@funsoft.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we stop allocating percpu memory for not-possible CPUs we must not touch
the percpu data for not-possible CPUs at all. The correct way of doing this
is to test cpu_possible() or to use for_each_cpu().
This patch is a kernel-wide sweep of all instances of NR_CPUS. I found very
few instances of this bug, if any. But the patch converts lots of open-coded
test to use the preferred helper macros.
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Christian Zankel <chris@zankel.net>
Cc: Philippe Elie <phil.el@wanadoo.fr>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Jens Axboe <axboe@suse.de>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The last changes that introduced the additional_cpus command line parameter
also introduced a regression regarding smp initialization speed. In
smp_setup_cpu_possible_map() cpu_present_map is set to the same value as
cpu_possible_map. Especially that means that bits in the present map will be
set for cpus that are not present. This will cause a slow down in the initial
cpu_up() loop in smp_init() since trying to take cpus online that aren't
present takes a while.
Fix this by setting only bits for present cpus in cpu_present_map and set
cpu_present_map to cpu_possible_map in smp_cpus_done().
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce possible_cpus command line option. Hard sets the number of bits set
in cpu_possible_map. Unlike the additional_cpus parameter this one guarantees
that num_possible_cpus() will stay constant even if the system gets rebooted
and a different number of cpus are present at startup.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce additional_cpus command line option. By default no additional cpu
can be attached to the system anymore. Only the cpus present at IPL time can
be switched on/off. If it is desired that additional cpus can be attached to
the system the maximum number of additional cpus needs to be specified with
this option.
This change is necessary in order to limit the waste of per_cpu data
structures.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initiliazing of cpu_possible_map was done in smp_prepare_cpus which is way too
late. Therefore assign a static value to cpu_possible_map, since we don't
have access to max_cpus in setup_arch.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sanitize some s390 Kconfig options. We have ARCH_S390, ARCH_S390X,
ARCH_S390_31, 64BIT, S390_SUPPORT and COMPAT. Replace these 6 options by
S390, 64BIT and COMPAT.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh Dickins <hugh@veritas.com>
Fix the broken atomic_cmpxchg primitive. Add atomic_sub_and_test,
atomic64_sub_return, atomic64_sub_and_test, atomic64_cmpxchg,
atomic64_add_unless and atomic64_inc_not_zero. Replace old style
atomic_compare_and_swap by atomic_cmpxchg. Shorten the whole header by
defining most primitives with the two inline functions atomic_add_return and
atomic_sub_return.
In addition this patch contains the s390 related fixes of Hugh's "mm: fill
arch atomic64 gaps" patch.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Run idle threads with preempt disabled.
Also corrected a bugs in arm26's cpu_idle (make it actually call schedule()).
How did it ever work before?
Might fix the CPU hotplugging hang which Nigel Cunningham noted.
We think the bug hits if the idle thread is preempted after checking
need_resched() and before going to sleep, then the CPU offlined.
After calling stop_machine_run, the CPU eventually returns from preemption and
into the idle thread and goes to sleep. The CPU will continue executing
previous idle and have no chance to call play_dead.
By disabling preemption until we are ready to explicitly schedule, this bug is
fixed and the idle threads generally become more robust.
From: alexs <ashepard@u.washington.edu>
PPC build fix
From: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
MIPS build fix
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add code to support the re-IPL method using diagnose 0x308.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Disable pseudo page fault handling before starting the new kernel and try
to use diag308 to reset the machine.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The kernel uses the SIGP external call order code to signal other CPUs. When
running with dedicated CPUs external calls don't get delivered immediately but
within a fixed polling invervall. This can lead to delays where the system
appears to do nothing. Replace the SIGP external call order with the SIGP
emergency call order since this one gets delivered immediately.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add interface to issue VM control program commands.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Improved machine check handling. Kernel is now able to receive machine checks
while in kernel mode (system call, interrupt and program check handling).
Also register validation is now performed.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
(The i386 CPU hotplug patch provides infrastructure for some work which Pavel
is doing as well as for ACPI S3 (suspend-to-RAM) work which Li Shaohua
<shaohua.li@intel.com> is doing)
The following provides i386 architecture support for safely unregistering and
registering processors during runtime, updated for the current -mm tree. In
order to avoid dumping cpu hotplug code into kernel/irq/* i dropped the
cpu_online check in do_IRQ() by modifying fixup_irqs(). The difference being
that on cpu offline, fixup_irqs() is called before we clear the cpu from
cpu_online_map and a long delay in order to ensure that we never have any
queued external interrupts on the APICs. There are additional changes to s390
and ppc64 to account for this change.
1) Add CONFIG_HOTPLUG_CPU
2) disable local APIC timer on dead cpus.
3) Disable preempt around irq balancing to prevent CPUs going down.
4) Print irq stats for all possible cpus.
5) Debugging check for interrupts on offline cpus.
6) Hacky fixup_irqs() to redirect irqs when cpus go off/online.
7) play_dead() for offline cpus to spin inside.
8) Handle offline cpus set in flush_tlb_others().
9) Grab lock earlier in smp_call_function() to prevent CPUs going down.
10) Implement __cpu_disable() and __cpu_die().
11) Enable local interrupts in cpu_enable() after fixup_irqs()
12) Don't fiddle with NMI on dead cpu, but leave intact on other cpus.
13) Program IRQ affinity whilst cpu is still in cpu_online_map on offline.
Signed-off-by: Zwane Mwaikambo <zwane@linuxpower.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!