The cpumask of the padata instance was used without allocated.
This caused boot crashes if CONFIG_CPUMASK_OFFSTACK is enabled.
This patch fixes this by doing proper allocation for this cpumask.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This makes sure that we pick the synchronous signals caused by a
processor fault over any pending regular asynchronous signals sent to
use by [t]kill().
This is not strictly required semantics, but it makes it _much_ easier
for programs like Wine that expect to find the fault information in the
signal stack.
Without this, if a non-synchronous signal gets picked first, the delayed
asynchronous signal will have its signal context pointing to the new
signal invocation, rather than the instruction that caused the SIGSEGV
or SIGBUS in the first place.
This is not all that pretty, and we're discussing making the synchronous
signals more explicit rather than have these kinds of implicit
preferences of SIGSEGV and friends. See for example
http://bugzilla.kernel.org/show_bug.cgi?id=15395
for some of the discussion. But in the meantime this is a simple and
fairly straightforward work-around, and the whole
if (x & Y)
x &= Y;
thing can be compiled into (and gcc does do it) just three instructions:
movq %rdx, %rax
andl $Y, %eax
cmovne %rax, %rdx
so it is at least a simple solution to a subtle issue.
Reported-and-tested-by: Pavel Vilim <wylda@volny.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
x86: Fix out of order of gsi
x86: apic: Fix mismerge, add arch_probe_nr_irqs() again
x86, irq: Keep chip_data in create_irq_nr and destroy_irq
xen: Remove unnecessary arch specific xen irq functions.
smp: Use nr_cpus= to set nr_cpu_ids early
x86, irq: Remove arch_probe_nr_irqs
sparseirq: Use radix_tree instead of ptrs array
sparseirq: Change irq_desc_ptrs to static
init: Move radix_tree_init() early
irq: Remove unnecessary bootmem code
x86: Add iMac9,1 to pci_reboot_dmi_table
x86: Convert i8259_lock to raw_spinlock
x86: Convert nmi_lock to raw_spinlock
x86: Convert ioapic_lock and vector_lock to raw_spinlock
x86: Avoid race condition in pci_enable_msix()
x86: Fix SCI on IOAPIC != 0
x86, ia32_aout: do not kill argument mapping
x86, irq: Move __setup_vector_irq() before the first irq enable in cpu online path
x86, irq: Update the vector domain for legacy irqs handled by io-apic
x86, irq: Don't block IRQ0_VECTOR..IRQ15_VECTOR's on all cpu's
...
* 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
early_res: Need to save the allocation name in drop_range_partial()
sparsemem: Fix compilation on PowerPC
early_res: Add free_early_partial()
x86: Fix non-bootmem compilation on PowerPC
core: Move early_res from arch/x86 to kernel/
x86: Add find_fw_memmap_area
Move round_up/down to kernel.h
x86: Make 32bit support NO_BOOTMEM
early_res: Enhance check_and_double_early_res
x86: Move back find_e820_area to e820.c
x86: Add find_early_area_size
x86: Separate early_res related code from e820.c
x86: Move bios page reserve early to head32/64.c
sparsemem: Put mem map for one node together.
sparsemem: Put usemap for one node together
x86: Make 64 bit use early_res instead of bootmem before slab
x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
x86: Make early_node_mem get mem > 4 GB if possible
x86: Dynamically increase early_res array size
x86: Introduce max_early_res and early_res_count
...
This warning in s_next() can be triggered by lseek():
[<c018b3f7>] ? s_next+0x77/0x80
[<c013e3c1>] warn_slowpath_common+0x81/0xa0
[<c018b3f7>] ? s_next+0x77/0x80
[<c013e3fa>] warn_slowpath_null+0x1a/0x20
[<c018b3f7>] s_next+0x77/0x80
[<c01efa77>] traverse+0x117/0x200
[<c01eff13>] seq_lseek+0xa3/0x120
[<c01efe70>] ? seq_lseek+0x0/0x120
[<c01d7081>] vfs_llseek+0x41/0x50
[<c01d8116>] sys_llseek+0x66/0xa0
[<c0102bd0>] sysenter_do_call+0x12/0x26
The iterator "leftover" variable is zeroed in the opening of the trace
file. But lseek can call s_start() which will call s_next() without
reseting the "leftover" variable back to zero, which might trigger
the WARN_ON_ONCE(iter->leftover) that is in s_next().
Cc: stable@kernel.org
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B8CE06A.9090207@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (38 commits)
SELinux: Make selinux_kernel_create_files_as() shouldn't just always return 0
TOMOYO: Protect find_task_by_vpid() with RCU.
Security: add static to security_ops and default_security_ops variable
selinux: libsepol: remove dead code in check_avtab_hierarchy_callback()
TOMOYO: Remove __func__ from tomoyo_is_correct_path/domain
security: fix a couple of sparse warnings
TOMOYO: Remove unneeded parameter.
TOMOYO: Use shorter names.
TOMOYO: Use enum for index numbers.
TOMOYO: Add garbage collector.
TOMOYO: Add refcounter on domain structure.
TOMOYO: Merge headers.
TOMOYO: Add refcounter on string data.
TOMOYO: Reduce lines by using common path for addition and deletion.
selinux: fix memory leak in sel_make_bools
TOMOYO: Extract bitfield
syslog: clean up needless comment
syslog: use defined constants instead of raw numbers
syslog: distinguish between /proc/kmsg and syscalls
selinux: allow MLS->non-MLS and vice versa upon policy reload
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6:
sparc: Support show_unhandled_signals.
sparc: use __ratelimit
sunxvr500: Additional PCI id for sunxvr500 driver
sparc: use asm-generic/scatterlist.h
sparc64: If 'slot-names' property exist, create sysfs PCI slot information.
sparc: remove trailing space in messages
sparc: remove redundant return statements
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1341 commits)
virtio_net: remove forgotten assignment
be2net: fix tx completion polling
sis190: fix cable detect via link status poll
net: fix protocol sk_buff field
bridge: Fix build error when IGMP_SNOOPING is not enabled
bnx2x: Tx barriers and locks
scm: Only support SCM_RIGHTS on unix domain sockets.
vhost-net: restart tx poll on sk_sndbuf full
vhost: fix get_user_pages_fast error handling
vhost: initialize log eventfd context pointer
vhost: logging thinko fix
wireless: convert to use netdev_for_each_mc_addr
ethtool: do not set some flags, if others failed
ipoib: returned back addrlen check for mc addresses
netlink: Adding inode field to /proc/net/netlink
axnet_cs: add new id
bridge: Make IGMP snooping depend upon BRIDGE.
bridge: Add multicast count/interval sysfs entries
bridge: Add hash elasticity/max sysfs entries
bridge: Add multicast_snooping sysfs toggle
...
Trivial conflicts in Documentation/feature-removal-schedule.txt
The ANY flag can show SMT data of another task (like 'top'),
so we want to disable it when system-wide profiling is
disabled.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because I was wondering why perf stat wasn't working as expected..
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Don Zickus <dzickus@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Aaro Koskinen reported an issue in kernel.org bugzilla #15366, where
on non-GENERIC_TIME systems, accessing
/sys/devices/system/clocksource/clocksource0/current_clocksource
results in an oops.
It seems the timekeeper/clocksource rework missed initializing the
curr_clocksource value in the !GENERIC_TIME case.
Thanks to Aaro for reporting and diagnosing the issue as well as
testing the fix!
Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: stable@kernel.org
LKML-Reference: <1267475683.4216.61.camel@localhost.localdomain>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
During free_early_partial(), reserve_early_without_check() could end
extending the early_res area from __check_and_double_early_res(); as a
result, the location of the name for the current reservation could
change.
Therefore, we need to save a local copy of the name.
[ hpa: rewrote comment and checkin description ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4B8C7C94.7070000@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The System RAM walk shall skip partial RAM pages and avoid calling
func() on them. So that page_is_ram() return 0 for a partial RAM page.
In particular, it shall not call func() with len=0.
This fixes a boot time bug reported by Sachin and root caused by Thomas:
> >>> WARNING: at arch/x86/mm/ioremap.c:111 __ioremap_caller+0x169/0x2f1()
> >>> Hardware name: BladeCenter LS21 -[79716AA]-
> >>> Modules linked in:
> >>> Pid: 0, comm: swapper Not tainted 2.6.33-git6-autotest #1
> >>> Call Trace:
> >>> [<ffffffff81047cff>] ? __ioremap_caller+0x169/0x2f1
> >>> [<ffffffff81063b7d>] warn_slowpath_common+0x77/0xa4
> >>> [<ffffffff81063bb9>] warn_slowpath_null+0xf/0x11
> >>> [<ffffffff81047cff>] __ioremap_caller+0x169/0x2f1
> >>> [<ffffffff813747a3>] ? acpi_os_map_memory+0x12/0x1b
> >>> [<ffffffff81047f10>] ioremap_nocache+0x12/0x14
> >>> [<ffffffff813747a3>] acpi_os_map_memory+0x12/0x1b
> >>> [<ffffffff81282fa0>] acpi_tb_verify_table+0x29/0x5b
> >>> [<ffffffff812827f0>] acpi_load_tables+0x39/0x15a
> >>> [<ffffffff8191c8f8>] acpi_early_init+0x60/0xf5
> >>> [<ffffffff818f2cad>] start_kernel+0x397/0x3a7
> >>> [<ffffffff818f2295>] x86_64_start_reservations+0xa5/0xa9
> >>> [<ffffffff818f237a>] x86_64_start_kernel+0xe1/0xe8
> >>> ---[ end trace 4eaa2a86a8e2da22 ]---
> >>> ioremap reserve_memtype failed -22
The return code is -EINVAL, so it failed in the is_ram check, which is
not too surprising
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009c000 (usable)
> BIOS-e820: 000000000009c000 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000cffa3900 (usable)
> BIOS-e820: 00000000cffa3900 - 00000000cffa7400 (ACPI data)
The ACPI data is not starting on a page boundary and neither does the
usable RAM area end on a page boundary. Very useful !
> ACPI: DSDT 00000000cffa3900 036CE (v01 IBM SERLEWIS 00001000 INTL 20060912)
ACPI is trying to map DSDT at cffa3900, which results in a check
vs. cffa3000 which is the relevant page boundary. The generic is_ram
check correctly identifies that as RAM because it's in the usable
resource area. The old e820 based is_ram check does not take
overlapping resource areas into account. That's why it works.
CC: Sachin Sant <sachinp@in.ibm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
LKML-Reference: <20100301135551.GA9998@localhost>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block: (38 commits)
block: don't access jiffies when initialising io_context
cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds
block: fix for "Consolidate phys_segment and hw_segment limits"
cfq-iosched: quantum check tweak
blktrace: perform cleanup after setup error
blkdev: fix merge_bvec_fn return value checks
cfq-iosched: requests "in flight" vs "in driver" clarification
cciss: Fix problem with scatter gather elements in the scsi half of the driver
cciss: eliminate unnecessary pointer use in cciss scsi code
cciss: do not use void pointer for scsi hba data
cciss: factor out scatter gather chain block mapping code
cciss: fix scatter gather chain block dma direction kludge
cciss: simplify scatter gather code
cciss: factor out scatter gather chain block allocation and freeing
cciss: detect bad alignment of scsi commands at build time
cciss: clarify command list padding calculation
cfq-iosched: rethink seeky detection for SSDs
cfq-iosched: rework seeky detection
block: remove padding from io_context on 64bit builds
block: Consolidate phys_segment and hw_segment limits
...
* 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
futex: Protect pid lookup in compat code with RCU
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Fix documentation of default chip disable()
* 'bkl-drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
nvram: Drop the BKL from nvram_open()
We support event unthrottling in breakpoint events. It means
that if we have more than sysctl_perf_event_sample_rate/HZ,
perf will throttle, ignoring subsequent events until the next
tick.
So if ptrace exceeds this max rate, it will omit events, which
breaks the ptrace determinism that is supposed to report every
triggered breakpoints. This is likely to happen if we set
sysctl_perf_event_sample_rate to 1.
This patch removes support for unthrottling in breakpoint
events to break throttling and restore ptrace determinism.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: 2.6.33.x <stable@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Currently even if BLKTRACESETUP ioctl has failed user must call
BLKTRACETEARDOWN to be shure what all staff was cleaned, which
is contr-intuitive.
Let's setup ioctl make necessery cleanup by it self.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
trace_clock.c includes spinlock.h, which ends up including
asm/system.h, which in turn includes linux/irqflags.h in x86.
So the definition of raw_local_irq_save is luckily covered there,
but this is not the case in parisc:
tip/kernel/trace/trace_clock.c:86: error: implicit declaration of function 'raw_local_irq_save'
tip/kernel/trace/trace_clock.c:112: error: implicit declaration of function 'raw_local_irq_restore'
We need to include linux/irqflags.h directly from trace_clock.c
to avoid such build error.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, mm: Unify kernel_physical_mapping_init() API
x86, mm: Allow highmem user page tables to be disabled at boot time
x86: Do not reserve brk for DMI if it's not going to be used
x86: Convert tlbstate_lock to raw_spinlock
x86: Use the generic page_is_ram()
x86: Remove BIOS data range from e820
Move page_is_ram() declaration to mm.h
Generic page_is_ram: use __weak
resources: introduce generic page_is_ram()
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
sched: Fix SCHED_MC regression caused by change in sched cpu_power
sched: Don't use possibly stale sched_class
kthread, sched: Remove reference to kthread_create_on_cpu
sched: cpuacct: Use bigger percpu counter batch values for stats counters
percpu_counter: Make __percpu_counter_add an inline function on UP
sched: Remove member rt_se from struct rt_rq
sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu]
sched: Remove unused update_shares_locked()
sched: Use for_each_bit
sched: Queue a deboosted task to the head of the RT prio queue
sched: Implement head queueing for sched_rt
sched: Extend enqueue_task to allow head queueing
sched: Remove USER_SCHED
sched: Fix the place where group powers are updated
sched: Assume *balance is valid
sched: Remove load_balance_newidle()
sched: Unify load_balance{,_newidle}()
sched: Add a lock break for PREEMPT=y
sched: Remove from fwd decls
sched: Remove rq_iterator from move_one_task
...
Fix up trivial conflicts in kernel/sched.c
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix race between ttwu() and task_rq_lock()
sched: Fix SMT scheduler regression in find_busiest_queue()
sched: Fix sched_mv_power_savings for !SMT
kernel/sched.c: Suppress unused var warning
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits)
ftrace: Add function names to dangling } in function graph tracer
tracing: Simplify memory recycle of trace_define_field
tracing: Remove unnecessary variable in print_graph_return
tracing: Fix typo of info text in trace_kprobe.c
tracing: Fix typo in prof_sysexit_enable()
tracing: Remove CONFIG_TRACE_POWER from kernel config
tracing: Fix ftrace_event_call alignment for use with gcc 4.5
ftrace: Remove memory barriers from NMI code when not needed
tracing/kprobes: Add short documentation for HAVE_REGS_AND_STACK_ACCESS_API
s390: Add pt_regs register and stack access API
tracing/kprobes: Make Kconfig dependencies generic
tracing: Unify arch_syscall_addr() implementations
tracing: Add notrace to TRACE_EVENT implementation functions
ftrace: Allow to remove a single function from function graph filter
tracing: Add correct/incorrect to sort keys for branch annotation output
tracing: Simplify test for function_graph tracing start point
tracing: Drop the tr check from the graph tracing path
tracing: Add stack dump to trace_printk if stacktrace option is set
tracing: Use appropriate perl constructs in recordmcount.pl
tracing: optimize recordmcount.pl for offsets-handling
...
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits)
rcu: Fix accelerated GPs for last non-dynticked CPU
rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot
rcu: Fix accelerated grace periods for last non-dynticked CPU
rcu: Export rcu_scheduler_active
rcu: Make rcu_read_lock_sched_held() take boot time into account
rcu: Make lockdep_rcu_dereference() message less alarmist
sched, cgroups: Fix module export
rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information
rcu: Fix rcutorture mod_timer argument to delay one jiffy
rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection
rcu: Convert to raw_spinlocks
rcu: Stop overflowing signed integers
rcu: Use canonical URL for Mathieu's dissertation
rcu: Accelerate grace period if last non-dynticked CPU
rcu: Fix citation of Mathieu's dissertation
rcu: Documentation update for CONFIG_PROVE_RCU
security: Apply lockdep-based checking to rcu_dereference() uses
idr: Apply lockdep-based diagnostics to rcu_dereference() uses
radix-tree: Disable RCU lockdep checking in radix tree
vfs: Abstract rcu_dereference_check for files-fdtable use
...
* 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
generic-ipi: Optimize accesses by using DEFINE_PER_CPU_SHARED_ALIGNED for IPI data
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
plist: Fix grammar mistake, and c-style mistake
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
kprobes: Add mcount to the kprobes blacklist
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86_64: Print modules like i386 does
* 'x86-doc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Put 'nopat' in kernel-parameters
* 'x86-gpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-64: Allow fbdev primary video code
* 'x86-rlimit-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Use helpers for rlimits
Add __percpu sparse annotations to hw_breakpoint.
These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.
In kernel/hw_breakpoint.c, per_cpu(nr_task_bp_pinned, cpu)'s will
trigger spurious noderef related warnings from sparse. Changing it to
&per_cpu(nr_task_bp_pinned[0], cpu) will work around the problem but
deemed to ugly by the maintainer. Leave it alone until better
solution can be found.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <4B7B4B7A.9050902@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
The function graph tracer is currently the most invasive tracer
in the ftrace family. It can easily overflow the buffer even with
10megs per CPU. This means that events can often be lost.
On start up, or after events are lost, if the function return is
recorded but the function enter was lost, all we get to see is the
exiting '}'.
Here is how a typical trace output starts:
[tracing] cat trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) + 91.897 us | }
0) ! 567.961 us | }
0) <========== |
0) ! 579.083 us | _raw_spin_lock_irqsave();
0) 4.694 us | _raw_spin_unlock_irqrestore();
0) ! 594.862 us | }
0) ! 603.361 us | }
0) ! 613.574 us | }
0) ! 623.554 us | }
0) 3.653 us | fget_light();
0) | sock_poll() {
There are a series of '}' with no matching "func() {". There's no information
to what functions these ending brackets belong to.
This patch adds a stack on the per cpu structure used in outputting
the function graph tracer to keep track of what function was outputted.
Then on a function exit event, it checks the depth to see if the
function exit has a matching entry event. If it does, then it only
prints the '}', otherwise it adds the function name after the '}'.
This allows function exit events to show what function they belong to
at trace output startup, when the entry was lost due to ring buffer
overflow, or even after a new task is scheduled in.
Here is what the above trace will look like after this patch:
[tracing] cat trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) + 91.897 us | } (irq_exit)
0) ! 567.961 us | } (smp_apic_timer_interrupt)
0) <========== |
0) ! 579.083 us | _raw_spin_lock_irqsave();
0) 4.694 us | _raw_spin_unlock_irqrestore();
0) ! 594.862 us | } (add_wait_queue)
0) ! 603.361 us | } (__pollwait)
0) ! 613.574 us | } (tcp_poll)
0) ! 623.554 us | } (sock_poll)
0) 3.653 us | fget_light();
0) | sock_poll() {
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The hibernate memory preallocation code allocates memory to push some
user space data out of physical RAM, so that the hibernation image is
not too large. It allocates more memory than necessary for creating
the image, so it has to release some pages to make room for
allocations made while suspending devices and disabling nonboot CPUs,
or the system will hang due to the lack of free pages to allocate
from. Unfortunately, the function used for freeing these pages,
free_unnecessary_pages(), contains a bug that prevents it from doing
the job on all systems without highmem.
Fix this problem, which is a regression from the 2.6.30 kernel, by
using the right condition for the termination of the loop in
free_unnecessary_pages().
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-and-tested-by: Alan Jenkins <sourcejedi.lkml@googlemail.com>
Cc: stable@kernel.org
Its contents and entry in Makefile were already removed in
8e60c6a134
(Shift remaining code from swsusp.c to hibernate.c)
but somehow it remained in-place (rjw: which most likely was my
mistake).
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Remove a trailing space from a message in swsusp_save().
Signed-off-by: Frans Pop <elendil@planet.nl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
It will never reach here if the sws_resume_bdev is erratic.
swsusp_read() is called only from software_resume(), but after
swsusp_check() which would catch the error state.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
They were deprecated and removed from exported headers more than 2
years ago. Inform users about their removal in the future now.
(Switch cases needed to be reorderded for an easy fall through.)
And add an entry to feature-removal-schedule.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Add configuration switch CONFIG_PM_ADVANCED_DEBUG for compiling in
extra PM debugging/testing code allowing one to access some
PM-related attributes of devices from the user space via sysfs.
If CONFIG_PM_ADVANCED_DEBUG is set, add sysfs attribute power/async
for every device allowing the user space to access the device's
power.async_suspend flag and modify it, if desired.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Add sysfs attribute /sys/power/pm_async allowing the user space to
disable/enable asynchronous suspend/resume of devices.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (48 commits)
x86/PCI: Prevent mmconfig memory corruption
ACPI: Use GPE reference counting to support shared GPEs
x86/PCI: use host bridge _CRS info by default on 2008 and newer machines
PCI: augment bus resource table with a list
PCI: add pci_bus_for_each_resource(), remove direct bus->resource[] refs
PCI: read bridge windows before filling in subtractive decode resources
PCI: split up pci_read_bridge_bases()
PCIe PME: use pci_pcie_cap()
PCI PM: Run-time callbacks for PCI bus type
PCIe PME: use pci_is_pcie()
PCI / ACPI / PM: Platform support for PCI PME wake-up
ACPI / ACPICA: Multiple system notify handlers per device
ACPI / PM: Add more run-time wake-up fields
ACPI: Use GPE reference counting to support shared GPEs
PCI PM: Make it possible to force using INTx for PCIe PME signaling
PCI PM: PCIe PME root port service driver
PCI PM: Add function for checking PME status of devices
PCI: mark is_pcie obsolete
PCI: set PCI_PREF_RANGE_TYPE_64 in pci_bridge_check_ranges
PCI: pciehp: second try to get big range for pcie devices
...
A recent commit introduced a preemption warning for
perf_clock(), use raw_smp_processor_id() to avoid this, it
really doesn't matter which cpu we use here.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1267198583.22519.684.camel@laptop>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On platforms like dual socket quad-core platform, the scheduler load
balancer is not detecting the load imbalances in certain scenarios. This
is leading to scenarios like where one socket is completely busy (with
all the 4 cores running with 4 tasks) and leaving another socket
completely idle. This causes performance issues as those 4 tasks share
the memory controller, last-level cache bandwidth etc. Also we won't be
taking advantage of turbo-mode as much as we would like, etc.
Some of the comparisons in the scheduler load balancing code are
comparing the "weighted cpu load that is scaled wrt sched_group's
cpu_power" with the "weighted average load per task that is not scaled
wrt sched_group's cpu_power". While this has probably been broken for a
longer time (for multi socket numa nodes etc), the problem got aggrevated
via this recent change:
|
| commit f93e65c186
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date: Tue Sep 1 10:34:32 2009 +0200
|
| sched: Restore __cpu_power to a straight sum of power
|
Also with this change, the sched group cpu power alone no longer reflects
the group capacity that is needed to implement MC, MT performance
(default) and power-savings (user-selectable) policies.
We need to use the computed group capacity (sgs.group_capacity, that is
computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
find out if the group with the max load is above its capacity and how
much load to move etc.
Reported-by: Ma Ling <ling.ma@intel.com>
Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
[ -v2: build fix ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org> # [2.6.32.x, 2.6.33.x]
LKML-Reference: <1266970432.11588.22.camel@sbs-t61.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the cpu argument to hw_perf_group_sched_in() is always
smp_processor_id(), simplify the code a little by removing this argument
and using the current cpu where needed.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Miller <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1265890918.5396.3.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds correct AMD NorthBridge event scheduling.
NB events are events measuring L3 cache, Hypertransport traffic. They are
identified by an event code >= 0xe0. They measure events on the
Northbride which is shared by all cores on a package. NB events are
counted on a shared set of counters. When a NB event is programmed in a
counter, the data actually comes from a shared counter. Thus, access to
those counters needs to be synchronized.
We implement the synchronization such that no two cores can be measuring
NB events using the same counters. Thus, we maintain a per-NB allocation
table. The available slot is propagated using the event_constraint
structure.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4b703957.0702d00a.6bf2.7b7d@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In certain situations, the kernel may need to stop and start the same
event rapidly. The current PMU callbacks do not distinguish between stop
and release (i.e., stop + free the resource). Thus, a counter may be
released, then it will be immediately re-acquired. Event scheduling will
again take place with no guarantee to assign the same counter. On some
processors, this may event yield to failure to assign the event back due
to competion between cores.
This patch is adding a new pair of callback to stop and restart a counter
without actually release the underlying counter resource. On stop, the
counter is stopped, its values saved and that's it. On start, the value
is reloaded and counter is restarted (on x86, actual restart is delayed
until perf_enable()).
Signed-off-by: Stephane Eranian <eranian@google.com>
[ added fallback to ->enable/->disable for all other PMUs
fixed x86_pmu_start() to call x86_pmu.enable()
merged __x86_pmu_disable into x86_pmu_stop() ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4b703875.0a04d00a.7896.ffffb824@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
DaveM reported that currently perf interprets the pgoff value reported by
the MMAP events as a byte range, but the kernel reports it as a page
offset.
Since its broken (and unusable) anyway, change the kernel behaviour (ABI)
to report bytes indeed, avoiding the need for userspace to deal with
PAGE_SIZE things.
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce kprobes jump optimization arch-independent parts.
Kprobes uses breakpoint instruction for interrupting execution
flow, on some architectures, it can be replaced by a jump
instruction and interruption emulation code. This gains kprobs'
performance drastically.
To enable this feature, set CONFIG_OPTPROBES=y (default y if the
arch supports OPTPROBE).
Changes in v9:
- Fix a bug to optimize probe when enabling.
- Check nearby probes can be optimize/unoptimize when disarming/arming
kprobes, instead of registering/unregistering. This will help
kprobe-tracer because most of probes on it are usually disabled.
Changes in v6:
- Cleanup coding style for readability.
- Add comments around get/put_online_cpus().
Changes in v5:
- Use get_online_cpus()/put_online_cpus() for avoiding text_mutex
deadlock.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
LKML-Reference: <20100225133407.6725.81992.stgit@localhost6.localdomain6>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Discard freeing field->type since it is not necessary.
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
LKML-Reference: <1266997226-6833-5-git-send-email-wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "cpu" variable is declared at the start of the function and
also within a branch, with the exact same initialization.
Remove the local variable of the same name in the branch.
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
LKML-Reference: <1266997226-6833-3-git-send-email-wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The power tracer has been converted to power trace events.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B84D50E.4070806@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
GCC 4.5 introduces behavior that forces the alignment of structures to
use the largest possible value. The default value is 32 bytes, so if
some structures are defined with a 4-byte alignment and others aren't
declared with an alignment constraint at all - it will align at 32-bytes.
For things like the ftrace events, this results in a non-standard array.
When initializing the ftrace subsystem, we traverse the _ftrace_events
section and call the initialization callback for each event. When the
structures are misaligned, we could be treating another part of the
structure (or the zeroed out space between them) as a function pointer.
This patch forces the alignment for all the ftrace_event_call structures
to 4 bytes.
Without this patch, the kernel fails to boot very early when built with
gcc 4.5.
It's trivial to check the alignment of the members of the array, so it
might be worthwhile to add something to the build system to do that
automatically. Unfortunately, that only covers this case. I've asked one
of the gcc developers about adding a warning when this condition is seen.
Cc: stable@kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
LKML-Reference: <4B85770B.6010901@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Under TREE_PREEMPT_RCU, print_other_cpu_stall() invokes
rcu_print_task_stall() with the root rcu_node structure's ->lock
held, and rcu_print_task_stall() acquires that same lock for
self-deadlock. Fix this by removing the lock acquisition from
rcu_print_task_stall(), and making all callers acquire the lock
instead.
Tested-by: John Kacur <jkacur@redhat.com>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Located-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-19-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The C standard does not specify the result of an operation that
overflows a signed integer, so such operations need to be
avoided. This patch changes the type of several fields from
"long" to "unsigned long" and adjusts operations as needed.
ULONG_CMP_GE() and ULONG_CMP_LT() macros are introduced to do
the modular comparisons that are appropriate given that overflow
is an expected event.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-17-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, rcu_needs_cpu() simply checks whether the current CPU
has an outstanding RCU callback, which means that the last CPU
to go into dyntick-idle mode might wait a few ticks for the
relevant grace periods to complete. However, if all the other
CPUs are in dyntick-idle mode, and if this CPU is in a quiescent
state (which it is for RCU-bh and RCU-sched any time that we are
considering going into dyntick-idle mode), then the grace period
is instantly complete.
This patch therefore repeatedly invokes the RCU grace-period
machinery in order to force any needed grace periods to complete
quickly. It does so a limited number of times in order to
prevent starvation by an RCU callback function that might pass
itself to call_rcu().
However, if any CPU other than the current one is not in
dyntick-idle mode, fall back to simply checking (with fix to bug
noted by Lai Jiangshan). Also, take advantage of last
grace-period forcing, the opportunity to do so noted by Steve
Rostedt. And apply simplified #ifdef condition suggested by
Frederic Weisbecker.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-15-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Inspection is proving insufficient to catch all RCU misuses,
which is understandable given that rcu_dereference() might be
protected by any of four different flavors of RCU (RCU, RCU-bh,
RCU-sched, and SRCU), and might also/instead be protected by any
of a number of locking primitives. It is therefore time to
enlist the aid of lockdep.
This set of patches is inspired by earlier work by Peter
Zijlstra and Thomas Gleixner, and takes the following approach:
o Set up separate lockdep classes for RCU, RCU-bh, and RCU-sched.
o Set up separate lockdep classes for each instance of SRCU.
o Create primitives that check for being in an RCU read-side
critical section. These return exact answers if lockdep is
fully enabled, but if unsure, report being in an RCU read-side
critical section. (We want to avoid false positives!)
The primitives are:
For RCU: rcu_read_lock_held(void)
For RCU-bh: rcu_read_lock_bh_held(void)
For RCU-sched: rcu_read_lock_sched_held(void)
For SRCU: srcu_read_lock_held(struct srcu_struct *sp)
o Add rcu_dereference_check(), which takes a second argument
in which one places a boolean expression based on the above
primitives and/or lockdep_is_held().
o A new kernel configuration parameter, CONFIG_PROVE_RCU, enables
rcu_dereference_check(). This depends on CONFIG_PROVE_LOCKING,
and should be quite helpful during the transition period while
CONFIG_PROVE_RCU-unaware patches are in flight.
The existing rcu_dereference() primitive does no checking, but
upcoming patches will change that.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Return -EINVAL for the bad size and for unrecognized NT_* type in
ptrace_regset() instead of -EIO.
Also update the comments for this ptrace interface with more clarifications.
Requested-by: Roland McGrath <roland@redhat.com>
Requested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100222225240.397523600@sbs-t61.sc.intel.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
find_task_by_vpid() is not safe without rcu_read_lock(). 2.6.33-rc7 got
RCU protection for sys_setpriority() but missed it for sys_getpriority().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce run-time PM callbacks for the PCI bus type. Make the new
callbacks work in analogy with the existing system sleep PM
callbacks, so that the drivers already converted to struct dev_pm_ops
can use their suspend and resume routines for run-time PM without
modifications.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Useful for freeing a portion of the resource tree, e.g. when trying to
reallocate resources more efficiently.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Now that we return the new resource start position, there is no
need to update "struct resource" inside the align function.
Therefore, mark the struct resource as const.
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
As suggested by Linus, align functions should return the start
of a resource, not void. An update of "res->start" is no longer
necessary.
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Make remaining netlink policies as const.
Fixup coding style where needed.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use radix_tree irq_desc_tree instead of irq_desc_ptrs.
-v2: according to Eric and cyrill to use radix_tree_lookup_slot and
radix_tree_replace_slot
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-32-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add replace_irq_desc() instead of poking at the array directly.
-v2: remove unneeded boundary check in replace_irq_desc
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-31-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
mem_init is moved early already.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-29-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
x86/mm is on 32-rc4 and missing the spinlock namespace changes which
are needed for further commits into this topic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
KPROBES_EVENT actually depends on the regs and stack access API
(b1cf540f) and not on x86.
So introduce a new config option which architectures can select if
they have the API implemented and switch x86.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
LKML-Reference: <20100210162517.GB6933@osiris.boeblingen.de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Most implementations of arch_syscall_addr() are the same, so create a
default version in common code and move the one piece that differs (the
syscall table) to asm/syscall.h. New arch ports don't have to waste
time copying & pasting this simple function.
The s390/sparc versions need to be different, so document why.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1264498803-17278-1-git-send-email-vapier@gentoo.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
setscheduler() saves task->sched_class outside of the rq->lock held
region for a check after the setscheduler changes have become
effective. That might result in checking a stale value.
rtmutex_setprio() has the same problem, though it is protected by
p->pi_lock against setscheduler(), but for correctness sake (and to
avoid bad examples) it needs to be fixed as well.
Retrieve task->sched_class inside of the rq->lock held region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
Paul Mackerras brought up a good point that when fallbacking to
software events, I may have been lucky in my configuration.
Modified the code to explicit provide a new configuration for
software events.
Suggested-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1266357745-26671-1-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Re-arrange the code so that when someone disables nmi_watchdog
with:
echo 0 > /proc/sys/kernel/nmi_watchdog
it releases the hardware reservation on the PMUs. This allows
the oprofile module to grab those PMUs and do its thing.
Otherwise oprofile fails to load because the hardware is
reserved by the perf_events subsystem.
Tested using:
oprofile --vm-linux --start
and watched it failed when nmi_watchdog is enabled and succeed
when:
oprofile --deinit && echo 0 > /proc/sys/kernel/nmi_watchdog
is run.
Note: this has the side quirk of having the nmi_watchdog latch
onto the software events instead of hardware events if oprofile
has already reserved the hardware first. User beware! :-)
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
Cc: eranian@google.com
LKML-Reference: <1266357892-30504-1-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This makes the range reservation feature available to other
architectures.
-v2: add get_max_mapped, max_pfn_mapped only defined in x86...
to fix PPC compiling
-v3: according to hpa, add CONFIG_HAVE_EARLY_RES
-v4: fix typo about EARLY_RES in config
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4B7B5723.4070009@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add __percpu sparse annotations to core subsystems.
These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-mm@kvack.org
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Biederman <ebiederm@xmission.com>
This patch fixes following sparse warnings:
include/linux/kfifo.h:127:25: warning: Using plain integer as NULL pointer
kernel/kfifo.c:83:21: warning: Using plain integer as NULL pointer
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
After kfifo rework it's no longer possible to reliably know if kfifo is
usable, since after kfifo_free(), kfifo_initialized() would still return
true. The correct behaviour is needed for at least FHCI USB driver.
This patch fixes the issue by resetting the kfifo to zero values (the
same approach is used in kfifo_alloc() if allocation failed).
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Conflicts: kernel/sched.c
Necessary due to the urgent fixes which conflict with the code move
from sched.c to sched_fair.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas found that due to ttwu() changing a task's cpu without holding
the rq->lock, task_rq_lock() might end up locking the wrong rq.
Avoid this by serializing against TASK_WAKING.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1266241712.15770.420.camel@laptop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix a SMT scheduler performance regression that is leading to a scenario
where SMT threads in one core are completely idle while both the SMT threads
in another core (on the same socket) are busy.
This is caused by this commit (with the problematic code highlighted)
commit bdb94aa5db
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Sep 1 10:34:38 2009 +0200
sched: Try to deal with low capacity
@@ -4203,15 +4223,18 @@ find_busiest_queue()
...
for_each_cpu(i, sched_group_cpus(group)) {
+ unsigned long power = power_of(i);
...
- wl = weighted_cpuload(i);
+ wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
+ wl /= power;
- if (rq->nr_running == 1 && wl > imbalance)
+ if (capacity && rq->nr_running == 1 && wl > imbalance)
continue;
On a SMT system, power of the HT logical cpu will be 589 and
the scheduler load imbalance (for scenarios like the one mentioned above)
can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
the weighted load with the power will result in "wl > imbalance" and
ultimately resulting in find_busiest_queue() return NULL, causing
load_balance() to think that the load is well balanced. But infact
one of the tasks can be moved to the idle core for optimal performance.
We don't need to use the weighted load (wl) scaled by the cpu power to
compare with imabalance. In that condition, we already know there is only a
single task "rq->nr_running == 1" and the comparison between imbalance,
wl is to make sure that we select the correct priority thread which matches
imbalance. So we really need to compare the imabalnce with the original
weighted load of the cpu and not the scaled load.
But in other conditions where we want the most hammered(busiest) cpu, we can
use scaled load to ensure that we consider the cpu power in addition to the
actual load on that cpu, so that we can move the load away from the
guy that is getting most hammered with respect to the actual capacity,
as compared with the rest of the cpu's in that busiest group.
Fix it.
Reported-by: Ma Ling <ling.ma@intel.com>
Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com>
Cc: stable@kernel.org [2.6.32.x]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf top: Fix help text alignment
perf: Fix hypervisor sample reporting
perf: Make bp_len type to u64 generic across the arch
Since mcount function can be called from everywhere,
it should be blacklisted. Moreover, the "mcount" symbol
is a special symbol name. So, it is better to put it in
the generic blacklist.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100205062433.3745.36726.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 22e19085 ("Honour event state for aux stream data")
introduced a bug where we would drop FORK events.
The thing is that we deliver FORK events to the child process'
event, which at that time will be PERF_EVENT_STATE_INACTIVE
because the child won't be scheduled in (we're in the middle of
fork).
Solve this twice, change the event state filter to exclude only
disabled (STATE_OFF) or worse, and deliver FORK events to the
current (parent).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Anton Blanchard <anton@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
LKML-Reference: <1266142324.5273.411.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Trying to add a probe like:
echo p:myprobe 0x10000 > /sys/kernel/debug/tracing/kprobe_events
will fail since the wrong pointer is passed to strict_strtoul
when trying to convert the address to an unsigned long.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100210162346.GA6933@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Not all arches have a PMU or have perf_event support for their
PMU. The nmi_watchdog will fail in those cases. Fallback to
using software events to generate nmi_watchdog traffic with
local apic interrupts.
Tested on a Pentium4 and it worked as expected, excepting for
detecting cpu lockups.
The problem with using software events as a cpu lock up detector
is the nmi_watchdog uses the logic that if local apic interrupts
stop incrementing then the cpu is probably locked up. But with
software events we use the local apic to trigger the
nmi_watchdog callback to see if local apic interrupts are still
firing, which obviously they are otherwise we wouldn't have been
triggered.
The algorithm to detect cpu lock ups is the same as the old
nmi_watchdog. Perhaps we need to find a better way to detect
lock ups?
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: peterz@infradead.org
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
LKML-Reference: <1266013161-31197-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The original patch was x86_64 centric. Changed the code to make
it less so.
ested by building and running on a powerpc.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: peterz@infradead.org
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
LKML-Reference: <1266013161-31197-2-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Generic support for PTRACE_GETREGSET/PTRACE_SETREGSET commands which
export the regsets supported by each architecture using the correponding
NT_* types. These NT_* types are already part of the userland ABI, used
in representing the architecture specific register sets as different NOTES
in an ELF core file.
'addr' parameter for the ptrace system call encode the REGSET type (using
the corresppnding NT_* type) and the 'data' parameter points to the
struct iovec having the user buffer and the length of that buffer.
struct iovec iov = { buf, len};
ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, &iov);
On successful completion, iov.len will be updated by the kernel specifying
how much the kernel has written/read to/from the user's iov.buf.
x86 extended state registers are primarily exported using this interface.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100211195614.886724710@sbs-t61.sc.intel.com>
Acked-by: Hongjiu Lu <hjl.tools@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
I don't see why we can only clear all functions from the filter.
After patching:
# echo sys_open > set_graph_function
# echo sys_close >> set_graph_function
# cat set_graph_function
sys_open
sys_close
# echo '!sys_close' >> set_graph_function
# cat set_graph_function
sys_open
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B726388.2000408@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
So make interface more consistent with early_res.
Later we can share some code with early_res.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-10-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
We have almost the same code for mtrr cleanup and amd_bus checkup, and
this code will also be used in replacing bootmem with early_res,
so try to move them together and reuse it from different parts.
Also rename update_range to subtract_range as that is what the
function is actually doing.
-v2: update comments as Christoph requested
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-4-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Keep chip_data in create_irq_nr and destroy_irq.
When two drivers are setting up MSI-X at the same time via
pci_enable_msix() there is a race. See this dmesg excerpt:
[ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
[ 85.170611] alloc irq_desc for 99 on node -1
[ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
[ 85.170614] alloc kstat_irqs on node -1
[ 85.170616] alloc irq_2_iommu on node -1
[ 85.170617] alloc irq_desc for 100 on node -1
[ 85.170619] alloc kstat_irqs on node -1
[ 85.170621] alloc irq_2_iommu on node -1
[ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
[ 85.170626] alloc irq_desc for 101 on node -1
[ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
[ 85.170630] alloc kstat_irqs on node -1
[ 85.170631] alloc irq_2_iommu on node -1
[ 85.170635] alloc irq_desc for 102 on node -1
[ 85.170636] alloc kstat_irqs on node -1
[ 85.170639] alloc irq_2_iommu on node -1
[ 85.170646] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000088
As you can see igb and ixgbe are both alternating on create_irq_nr()
via pci_enable_msix() in their probe function.
ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
NULL via dynamic_irq_init().
igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:
cfg_new = irq_desc_ptrs[102]->chip_data;
if (cfg_new->vector != 0)
continue;
This hits the NULL deref.
Another possible race exists via pci_disable_msix() in a driver or in
the number of error paths that call free_msi_irqs():
destroy_irq()
dynamic_irq_cleanup() which sets desc->chip_data = NULL
...race window...
desc->chip_data = cfg;
Remove the save and restore code for cfg in create_irq_nr() and
destroy_irq() and take the desc->lock when checking the irq_cfg.
Reported-and-analyzed-by: Brandon Philips <bphilips@suse.de>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Brandon Phililps <bphilips@suse.de>
Cc: stable@kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Export getboottime and monotonic_to_bootbased in order to let them
could be used by following patch.
Cc: stable@kernel.org
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kthread_create_on_cpu doesn't exist so update a comment in
kthread.c to reflect this.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100209040740.GB3702@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kthread_create_on_cpu doesn't exist so update a comment in kthread.c to reflect
this.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are
enabled we can call cpuacct_update_stats with values much larger
than percpu_counter_batch. This means the call to
percpu_counter_add will always add to the global count which is
protected by a spinlock and we end up with a global spinlock in
the scheduler.
Based on an idea by KOSAKI Motohiro, this patch scales the batch
value by cputime_one_jiffy such that we have the same batch
limit as we would if CONFIG_VIRT_CPU_ACCOUNTING was disabled.
His patch did this once at boot but that initialisation happened
too early on PowerPC (before time_init) and it was never updated
at runtime as a result of a hotplug cpu add/remove.
This patch instead scales percpu_counter_batch by
cputime_one_jiffy at runtime, which keeps the batch correct even
after cpu hotplug operations. We cap it at INT_MAX in case of
overflow.
For architectures that do not support
CONFIG_VIRT_CPU_ACCOUNTING, cputime_one_jiffy is the constant 1
and gcc is smart enough to optimise min(s32
percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
least on x86 and PowerPC. So there is no need to add an #ifdef.
On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark
is 234x faster and almost matches a CONFIG_CGROUP_CPUACCT
disabled kernel:
CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec
Tested with:
wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On UP:
kernel/sched.c: In function 'wake_up_new_task':
kernel/sched.c:2631: warning: unused variable 'cpu'
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These are the bits that enable the new nmi_watchdog and safely
isolate the old nmi_watchdog. Only one or the other can run,
not both at the same time.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
Cc: peterz@infradead.org
LKML-Reference: <1265424425-31562-4-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a new generic nmi_watchdog implementation using the perf
events infrastructure as suggested by Ingo.
The implementation is simple, just create an in-kernel perf
event and register an overflow handler to check for cpu lockups.
I created a generic implementation that lives in kernel/ and
the hardware specific part that for now lives in arch/x86.
This approach has a number of advantages:
- It simplifies the x86 PMU implementation in the long run,
in that it removes the hardcoded low-level PMU implementation
that was the NMI watchdog before.
- It allows new NMI watchdog features to be added in a central
place.
- It allows other architectures to enable the NMI watchdog,
as long as they have perf events (that provide NMIs)
implemented.
- It also allows for more graceful co-existence of existing
perf events apps and the NMI watchdog - before these changes
the relationship was exclusive. (The NMI watchdog will 'spend'
a perf event when enabled. In later iterations we might be
able to piggyback from an existing NMI event without having
to allocate a hardware event for the NMI watchdog - turning
this into a no-hardware-cost feature.)
As for compatibility, we'll keep the old NMI watchdog code as
well until the new one can 100% replace it on all CPUs, old and
new alike. That might take some time as the NMI watchdog has
been ported to many CPU models.
I have done light testing to make sure the framework works
correctly and it does.
v2: Set the correct timeout values based on the old nmi
watchdog
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
Cc: peterz@infradead.org
LKML-Reference: <1265424425-31562-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a clocksource suspend callback. This callback can be used by the
clocksource driver to shutdown and perform any kind of late suspend
activities even though the clocksource driver itself is a non-sysdev
driver.
One example where this is useful is to fix the sh_cmt.c platform driver
that today suspends using the platform bus and shuts down the clocksource
too early.
With this callback in place the sh_cmt driver will suspend using the
clocksource and clockevent hooks and leave the platform device pm
callbacks unused.
Signed-off-by: Magnus Damm <damm@opensource.se>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pass the clocksource as an argument to the clocksource resume callback.
Needed so we can point out which CMT channel the sh_cmt.c driver shall
resume.
Signed-off-by: Magnus Damm <damm@opensource.se>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some misspelled occurences of 'octet' and some comments were also fixed
as I was on it.
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Some comments misspell "should" or "shouldn't"; this fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Since mcount function can be called from everywhere,
it should be blacklisted. Moreover, the "mcount" symbol
is a special symbol name. So, it is better to put it in
the generic blacklist.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100205062433.3745.36726.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
futex: Handle futex value corruption gracefully
futex: Handle user space corruption gracefully
futex_lock_pi() key refcnt fix
softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume
Some comments misspell "invocation"; this fixes them. No code
changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Some comments misspell "truly"; this fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pretty much all of the calls do perf_disable/perf_enable cycles, pull
that out to cut back on hardware programming.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's a duplicate of tg->rt_se[cpu] and the only usage is
sched_rt_rq_dequeue() and sched_rt_rq_enqueue(). After the
first patch to those two function. rt_se can be removed.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <2674af741001282258q38781619u653ca4a7dd267347@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is the first step to remove rt_rq member rt_se because it have the
same meaning with tg->rt_se[cpu]. And the latter style is also used by
the fair scheduling class.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <2674af741001282257r28c97a92o9f90cf16fe8d3d84@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove record freezing. Because kprobes never puts probe on
ftrace's mcount call anymore, it doesn't need ftrace to check
whether kprobes on it.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: przemyslaw@pawelczyk.it
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100202214925.4694.73469.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Check whether the address of new probe is already reserved by
ftrace or alternatives (on x86) when registering new probe.
If reserved, it returns an error and not register the probe.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: przemyslaw@pawelczyk.it
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <20100202214918.4694.94179.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introducing *_text_reserved functions for checking the text
address range is partially reserved or not. This patch provides
checking routines for x86 smp alternatives and dynamic ftrace.
Since both functions modify fixed pieces of kernel text, they
should reserve and protect those from other dynamic text
modifier, like kprobes.
This will also be extended when introducing other subsystems
which modify fixed pieces of kernel text. Dynamic text modifiers
should avoid those.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: przemyslaw@pawelczyk.it
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <20100202214911.4694.16587.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Disable kprobe booster when CONFIG_PREEMPT=y at this time,
because it can't ensure that all kernel threads preempted on
kprobe's boosted slot run out from the slot even using
freeze_processes().
The booster on preemptive kernel will be resumed if
synchronize_tasks() or something like that is introduced.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100202214904.4694.24330.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Right now the syslog "type" action are just raw numbers which makes
the source difficult to follow. This patch replaces the raw numbers
with defined constants for some level of sanity.
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
This allows the LSM to distinguish between syslog functions originating
from /proc/kmsg access and direct syscalls. By default, the commoncaps
will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg
file descriptor. For example the kernel syslog reader can now drop
privileges after opening /proc/kmsg, instead of staying privileged with
CAP_SYS_ADMIN. MAC systems that implement security_syslog have unchanged
behavior.
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
Change 'bp_len' type to __u64 to make it work across archs as
the s390 architecture watch point length can be upto 2^64.
reference:
http://lkml.org/lkml/2010/1/25/212
This is an ABI change that is not backward compatible with
the previous hardware breakpoint info layout integrated in this
development cycle, a rebuilt of perf tools is necessary for
versions based on 2.6.33-rc1 - 2.6.33-rc6 to work with a
kernel based on this patch.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin <schwidefsky@de.ibm.com>
LKML-Reference: <20100130045518.GA20776@in.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
hrtimers callbacks are always done from hardirq context, either the
jiffy tick interrupt or the hrtimer device interrupt.
[ there is currently one exception that can still call a hrtimer
callback from softirq, but even in that case this will still
work correctly. ]
Reported-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yury Polyanskiy <ypolyans@princeton.edu>
Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
LKML-Reference: <1265120401.24455.306.camel@laptop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The WARN_ON in lookup_pi_state which complains about a mismatch
between pi_state->owner->pid and the pid which we retrieved from the
user space futex is completely bogus.
The code just emits the warning and then continues despite the fact
that it detected an inconsistent state of the futex. A conveniant way
for user space to spam the syslog.
Replace the WARN_ON by a consistency check. If the values do not match
return -EINVAL and let user space deal with the mess it created.
This also fixes the missing task_pid_vnr() when we compare the
pi_state->owner pid with the futex value.
Reported-by: Jermome Marchand <jmarchan@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
If the owner of a PI futex dies we fix up the pi_state and set
pi_state->owner to NULL. When a malicious or just sloppy programmed
user space application sets the futex value to 0 e.g. by calling
pthread_mutex_init(), then the futex can be acquired again. A new
waiter manages to enqueue itself on the pi_state w/o damage, but on
unlock the kernel dereferences pi_state->owner and oopses.
Prevent this by checking pi_state->owner in the unlock path. If
pi_state->owner is not current we know that user space manipulated the
futex value. Ignore the mess and return -EINVAL.
This catches the above case and also the case where a task hijacks the
futex by setting the tid value and then tries to unlock it.
Reported-by: Jermome Marchand <jmarchan@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
This fixes a futex key reference count bug in futex_lock_pi(),
where a key's reference count is incremented twice but decremented
only once, causing the backing object to not be released.
If the futex is created in a temporary file in an ext3 file system,
this bug causes the file's inode to become an "undead" orphan,
which causes an oops from a BUG_ON() in ext3_put_super() when the
file system is unmounted. glibc's test suite is known to trigger this,
see <http://bugzilla.kernel.org/show_bug.cgi?id=14256>.
The bug is a regression from 2.6.28-git3, namely Peter Zijlstra's
38d47c1b70 "[PATCH] futex: rely on
get_user_pages() for shared futexes". That commit made get_futex_key()
also increment the reference count of the futex key, and updated its
callers to decrement the key's reference count before returning.
Unfortunately the normal exit path in futex_lock_pi() wasn't corrected:
the reference count is incremented by get_futex_key() and queue_lock(),
but the normal exit path only decrements once, via unqueue_me_pi().
The fix is to put_futex_key() after unqueue_me_pi(), since 2.6.31
this is easily done by 'goto out_put_key' rather than 'goto out'.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
In cgroup_create(), if alloc_css_id() returns failure, the errno is not
propagated to userspace, so mkdir will fail silently.
To trigger this bug, we mount blkio (or memory subsystem), and create more
then 65534 cgroups. (The number of cgroups is limited to 65535 if a
subsystem has use_id == 1)
# mount -t cgroup -o blkio xxx /mnt
# for ((i = 0; i < 65534; i++)); do mkdir /mnt/$i; done
# mkdir /mnt/65534
(should return ENOSPC)
#
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kfifo kernel-doc warnings:
Warning(kernel/kfifo.c:361): No description found for parameter 'total'
Warning(kernel/kfifo.c:402): bad line: @ @lenout: pointer to output variable with copied data
Warning(kernel/kfifo.c:412): No description found for parameter 'lenout'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Free memory allocated using kmem_cache_zalloc using kmem_cache_free rather
than kfree.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression x,E,c;
@@
x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
... when != x = E
when != &x
?-kfree(x)
+kmem_cache_free(c,x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Steve Dickson <steved@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Morris <jmorris@namei.org>
When we cat <debugfs>/tracing/stack_trace, we may cause circular lock:
sys_read()
t_start()
arch_spin_lock(&max_stack_lock);
t_show()
seq_printf(), vsnprintf() .... /* they are all trace-able,
when they are traced, max_stack_lock may be required again. */
The following script can trigger this circular dead lock very easy:
#!/bin/bash
echo 1 > /proc/sys/kernel/stack_tracer_enabled
mount -t debugfs xxx /mnt > /dev/null 2>&1
(
# make check_stack() zealous to require max_stack_lock
for ((; ;))
{
echo 1 > /mnt/tracing/stack_max_size
}
) &
for ((; ;))
{
cat /mnt/tracing/stack_trace > /dev/null
}
To fix this bug, we increase the percpu trace_active before
require the lock.
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B67D4F9.9080905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit f492e12ef0 ("sched: Remove
load_balance_newidle()") removed the only user of this function,
so remove it too.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1265019219.24455.128.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
No change in functionality.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1264938810-4173-1-git-send-email-akinobu.mita@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use __weak instead of __attribute__((weak)).
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
It's based on walk_system_ram_range(), for archs that don't have
their own page_is_ram().
The static verions in MIPS and SCORE are also made global.
v4: prefer plain 1 instead of PAGE_IS_RAM (H. Peter Anvin)
v3: add comment (KAMEZAWA Hiroyuki)
"AFAIK, this "System RAM" information has been used for kdump to
grab valid memory area and seems good for the kernel itself."
v2: add PAGE_IS_RAM macro (Américo Wang)
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: linux-mips@linux-mips.org
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
LKML-Reference: <20100122081619.GA6431@localhost>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, hw_breakpoint, kgdb: Do not take mutex for kernel debugger
x86, hw_breakpoints, kgdb: Fix kgdb to use hw_breakpoint API
hw_breakpoints: Release the bp slot if arch_validate_hwbkpt_settings() fails.
perf: Ignore perf.data.old
perf report: Fix segmentation fault when running with '-g none'
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Correct printk whitespace in warning from cpu down task check
sched: Fix incorrect sanity check
sched: Fix fork vs hotplug vs cpuset namespaces
When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, sched_clock() gets
the time from hardware such as the TSC on x86. In this
configuration kgdb will report a softlock warning message on
resuming or detaching from a debug session.
Sequence of events in the problem case:
1) "cpu sched clock" and "hardware time" are at 100 sec prior
to a call to kgdb_handle_exception()
2) Debugger waits in kgdb_handle_exception() for 80 sec and on
exit the following is called ... touch_softlockup_watchdog() -->
__raw_get_cpu_var(touch_timestamp) = 0;
3) "cpu sched clock" = 100s (it was not updated, because the
interrupt was disabled in kgdb) but the "hardware time" = 180 sec
4) The first timer interrupt after resuming from
kgdb_handle_exception updates the watchdog from the "cpu sched clock"
update_process_times() { ... run_local_timers() -->
softlockup_tick() --> check (touch_timestamp == 0) (it is "YES"
here, we have set "touch_timestamp = 0" at kgdb) -->
__touch_softlockup_watchdog() ***(A)--> reset "touch_timestamp"
to "get_timestamp()" (Here, the "touch_timestamp" will still be
set to 100s.) ...
scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched
clock" to "hardware time" = 180s) ... }
5) The Second timer interrupt handler appears to have a large
jump and trips the softlockup warning.
update_process_times() { ... run_local_timers() -->
softlockup_tick() --> "cpu sched clock" - "touch_timestamp" =
180s-100s > 60s --> printk "soft lockup error messages" ... }
note: ***(A) reset "touch_timestamp" to
"get_timestamp(this_cpu)"
Why is "touch_timestamp" 100 sec, instead of 180 sec?
When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, the call trace of
get_timestamp() is:
get_timestamp(this_cpu)
-->cpu_clock(this_cpu)
-->sched_clock_cpu(this_cpu)
-->__update_sched_clock(sched_clock_data, now)
The __update_sched_clock() function uses the GTOD tick value to
create a window to normalize the "now" values. So if "now"
value is too big for sched_clock_data, it will be ignored.
The fix is to invoke sched_clock_tick() to update "cpu sched
clock" in order to recover from this state. This is done by
introducing the function touch_softlockup_watchdog_sync(). This
allows kgdb to request that the sched clock is updated when the
watchdog thread runs the first time after a resume from kgdb.
[yong.zhang0@gmail.com: Use per cpu instead of an array]
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Cc: peterz@infradead.org
LKML-Reference: <1264631124-4837-2-git-send-email-jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes the regression in functionality where the
kernel debugger and the perf API do not nicely share hw
breakpoint reservations.
The kernel debugger cannot use any mutex_lock() calls because it
can start the kernel running from an invalid context.
A mutex free version of the reservation API needed to get
created for the kernel debugger to safely update hw breakpoint
reservations.
The possibility for a breakpoint reservation to be concurrently
processed at the time that kgdb interrupts the system is
improbable. Should this corner case occur the end user is
warned, and the kernel debugger will prohibit updating the
hardware breakpoint reservations.
Any time the kernel debugger reserves a hardware breakpoint it
will be a system wide reservation.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: torvalds@linux-foundation.org
LKML-Reference: <1264719883-7285-3-git-send-email-jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the 2.6.33 kernel, the hw_breakpoint API is now used for the
performance event counters. The hw_breakpoint_handler() now
consumes the hw breakpoints that were previously set by kgdb
arch specific code. In order for kgdb to work in conjunction
with this core API change, kgdb must use some of the low level
functions of the hw_breakpoint API to install, uninstall, and
deal with hw breakpoint reservations.
The kgdb core required a change to call kgdb_disable_hw_debug
anytime a slave cpu enters kgdb_wait() in order to keep all the
hw breakpoints in sync as well as to prevent hitting a hw
breakpoint while kgdb is active.
During the architecture specific initialization of kgdb, it will
pre-allocate 4 disabled (struct perf event **) structures. Kgdb
will use these to manage the capabilities for the 4 hw
breakpoint registers, per cpu. Right now the hw_breakpoint API
does not have a way to ask how many breakpoints are available,
on each CPU so it is possible that the install of a breakpoint
might fail when kgdb restores the system to the run state. The
intent of this patch is to first get the basic functionality of
hw breakpoints working and leave it to the person debugging the
kernel to understand what hw breakpoints are in use and what
restrictions have been imposed as a result. Breakpoint
constraints will be dealt with in a future patch.
While atomic, the x86 specific kgdb code will call
arch_uninstall_hw_breakpoint() and arch_install_hw_breakpoint()
to manage the cpu specific hw breakpoints.
The net result of these changes allow kgdb to use the same pool
of hw_breakpoints that are used by the perf event API, but
neither knows about future reservations for the available hw
breakpoint slots.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: torvalds@linux-foundation.org
LKML-Reference: <1264719883-7285-2-git-send-email-jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ntp.c doesn't need to access timekeeping internals directly, so change
xtime references to use the get_seconds() timekeeping interface.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: richard@rsk.demon.co.uk
LKML-Reference: <1264738844-21935-1-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make time_esterror and time_maxerror static as no one uses them
outside of ntp.c
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: richard@rsk.demon.co.uk
LKML-Reference: <1264719761.3437.47.camel@localhost.localdomain>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
One problem with frequency driven counters is that we cannot
predict the rate at which they trigger, therefore we have to
start them at period=1, this causes a ramp up effect. However,
if we fail to propagate the stable state on fork each new child
will have to ramp up again. This can lead to significant
artifacts in sample data.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: eranian@google.com
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1264752266.4283.2121.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The return values of the kprobe's tracing functions are meaningless,
lets remove these.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <4B60E9A3.2040505@cn.fujitsu.com>
[fweisbec@gmail: whitespace fixes, drop useless void returns in end
of functions]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Introduce ftrace_perf_buf_prepare() and ftrace_perf_buf_submit() to
gather the common code that operates on raw events sampling buffer.
This cleans up redundant code between regular trace events, syscall
events and kprobe events.
Changelog v1->v2:
- Rename function name as per Masami and Frederic's suggestion
- Add __kprobes for ftrace_perf_buf_prepare() and make
ftrace_perf_buf_submit() inline as per Masami's suggestion
- Export ftrace_perf_buf_prepare since modules will use it
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <4B60E92D.9000808@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
In the function graph tracer, a calling function is to be traced
only when it is enabled through the set_graph_function file,
or when it is nested in an enabled function.
Current code uses TSK_TRACE_FL_GRAPH to test whether it is nested
or not. Looking at the code, we can get this:
(trace->depth > 0) <==> (TSK_TRACE_FL_GRAPH is set)
trace->depth is more explicit to tell that it is nested.
So we use trace->depth directly and simplify the code.
No functionality is changed.
TSK_TRACE_FL_GRAPH is not removed yet, it is left for future usage.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B4DB0B6.7040607@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
On a given architecture, when hardware breakpoint registration fails
due to un-supported access type (read/write/execute), we lose the bp
slot since register_perf_hw_breakpoint() does not release the bp slot
on failure.
Hence, any subsequent hardware breakpoint registration starts failing
with 'no space left on device' error.
This patch introduces error handling in register_perf_hw_breakpoint()
function and releases bp slot on error.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
LKML-Reference: <20100121125516.GA32521@in.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Due to an incorrect line break the output currently contains tabs.
Also remove trailing space.
The actual output that logcheck sent me looked like this:
Task events/1 (pid = 10) is on cpu 1^I^I^I^I(state = 1, flags = 84208040)
After this patch it becomes:
Task events/1 (pid = 10) is on cpu 1 (state = 1, flags = 84208040)
Signed-off-by: Frans Pop <elendilplanet.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <201001251456.34996.elendil@planet.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We moved to migrate on wakeup, which means that sleeping tasks could
still be present on offline cpus. Amend the check to only test running
tasks.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There was a bug in the old period code that caused intel_pmu_enable_all()
or native_write_msr_safe() to show up quite high in the profiles.
In staring at that code it made my head hurt, so I rewrote it in a
hopefully simpler fashion. Its now fully symetric between tick and
overflow driven adjustments and uses less data to boot.
The only complication is that it basically wants to do a u128 division.
The code approximates that in a rather simple truncate until it fits
fashion, taking care to balance the terms while truncating.
This version does not generate that sampling artefact.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Lockdep has found the real bug, but the output doesn't look right to me:
> =========================================================
> [ INFO: possible irq lock inversion dependency detected ]
> 2.6.33-rc5 #77
> ---------------------------------------------------------
> emacs/1609 just changed the state of lock:
> (&(&tty->ctrl_lock)->rlock){+.....}, at: [<ffffffff8127c648>] tty_fasync+0xe8/0x190
> but this lock took another, HARDIRQ-unsafe lock in the past:
> (&(&sighand->siglock)->rlock){-.....}
"HARDIRQ-unsafe" and "this lock took another" looks wrong, afaics.
> ... key at: [<ffffffff81c054a4>] __key.46539+0x0/0x8
> ... acquired at:
> [<ffffffff81089af6>] __lock_acquire+0x1056/0x15a0
> [<ffffffff8108a0df>] lock_acquire+0x9f/0x120
> [<ffffffff81423012>] _raw_spin_lock_irqsave+0x52/0x90
> [<ffffffff8127c1be>] __proc_set_tty+0x3e/0x150
> [<ffffffff8127e01d>] tty_open+0x51d/0x5e0
The stack-trace shows that this lock (ctrl_lock) was taken under
->siglock (which is hopefully irq-safe).
This is a clear typo in check_usage_backwards() where we tell the print a
fancy routine we're forwards.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100126181641.GA10460@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Update the graph tracer examples to cover the new frame pointer semantics
(in terms of passing it along). Move the HAVE_FUNCTION_GRAPH_FP_TEST docs
out of the Kconfig, into the right place, and expand on the details.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
LKML-Reference: <1264165967-18938-1-git-send-email-vapier@gentoo.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the iterator comes to an empty page for some reason, or if
the page is emptied by a consuming read. The iterator code currently
does not check if the iterator is pass the contents, and may
return a false entry.
This patch adds a check to the ring buffer iterator to test if the
current page has been completely read and sets the iterator to the
next page if necessary.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Usually reads of the ring buffer is performed by a single task.
There are two types of reads from the ring buffer.
One is a consuming read which will consume the entry that was read
and the next read will be the entry that follows.
The other is an iterator that will let the user read the contents of
the ring buffer without modifying it. When an iterator is allocated,
writes to the ring buffer are disabled to protect the iterator.
The problem exists when consuming reads happen while an iterator is
allocated. Specifically, the kind of read that swaps out an entire
page (used by splice) and replaces it with a new read. If the iterator
is on the page that is swapped out, then the next read may read
from this swapped out page and return garbage.
This patch adds a check when reading the iterator to make sure that
the iterator contents are still valid. If a consuming read has taken
place, the iterator is reset.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
commit 0f8e8ef7 (clocksource: Simplify clocksource watchdog resume
logic) introduced a potential kgdb dead lock. When the kernel is
stopped by kgdb inside code which holds watchdog_lock then kgdb dead
locks in clocksource_resume_watchdog().
clocksource_resume_watchdog() is called from kbdg via
clocksource_touch_watchdog() to avoid that the clock source watchdog
marks TSC unstable after the kernel has been stopped.
Solve this by replacing spin_lock with a spin_trylock and just return
in case the lock is held. Not resetting the watchdog might result in
TSC becoming marked unstable, but that's an acceptable penalty for
using kgdb.
The timekeeping is anyway easily screwed up by kgdb when the system
uses either jiffies or a clock source which wraps in short intervals
(e.g. pm_timer wraps about every 4.6s), so we really do not have to
worry about that occasional TSC marked unstable side effect.
The second caller of clocksource_resume_watchdog() is
clocksource_resume(). The trylock is safe here as well because the
system is UP at this point, interrupts are disabled and nothing else
can hold watchdog_lock().
Reported-by: Jason Wessel <jason.wessel@windriver.com>
LKML-Reference: <1264480000-6997-4-git-send-email-jason.wessel@windriver.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the contents of the ftrace ring buffer gets corrupted and the trace
file is read, it could create a kernel oops (usualy just killing the user
task thread). This is caused by the checking of the pid in the buffer.
If the pid is negative, it still references the cmdline cache array,
which could point to an invalid address.
The simple fix is to test for negative PIDs.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clockevent: Don't remove broadcast device when cpu is dead
* git://git.infradead.org/~dwmw2/mtd-2.6.33:
mtd: tests: fix read, speed and stress tests on NOR flash
mtd: Really add ARM pismo support
kmsg_dump: Dump on crash_kexec as well
rtmutex_set_prio() is used to implement priority inheritance for
futexes. When a task is deboosted it gets enqueued at the tail of its
RT priority list. This is violating the POSIX scheduling semantics:
rt priority list X contains two runnable tasks A and B
task A runs with priority X and holds mutex M
task C preempts A and is blocked on mutex M
-> task A is boosted to priority of task C (Y)
task A unlocks the mutex M and deboosts itself
-> A is dequeued from rt priority list Y
-> A is enqueued to the tail of rt priority list X
task C schedules away
task B runs
This is wrong as task A did not schedule away and therefor violates
the POSIX scheduling semantics.
Enqueue the task to the head of the priority list instead.
Reported-by: Mathias Weber <mathias.weber.mw1@roche.com>
Reported-by: Carsten Emde <cbe@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Carsten Emde <cbe@osadl.org>
Tested-by: Mathias Weber <mathias.weber.mw1@roche.com>
LKML-Reference: <20100120171629.809074113@linutronix.de>
The ability of enqueueing a task to the head of a SCHED_FIFO priority
list is required to fix some violations of POSIX scheduling policy.
Implement the functionality in sched_rt.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Carsten Emde <cbe@osadl.org>
Tested-by: Mathias Weber <mathias.weber.mw1@roche.com>
LKML-Reference: <20100120171629.772169931@linutronix.de>
The ability of enqueueing a task to the head of a SCHED_FIFO priority
list is required to fix some violations of POSIX scheduling policy.
Extend the related functions with a "head" argument.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Carsten Emde <cbe@osadl.org>
Tested-by: Mathias Weber <mathias.weber.mw1@roche.com>
LKML-Reference: <20100120171629.734886007@linutronix.de>
There are a number of issues:
1) TASK_WAKING vs cgroup_clone (cpusets)
copy_process():
sched_fork()
child->state = TASK_WAKING; /* waiting for wake_up_new_task() */
if (current->nsproxy != p->nsproxy)
ns_cgroup_clone()
cgroup_clone()
mutex_lock(inode->i_mutex)
mutex_lock(cgroup_mutex)
cgroup_attach_task()
ss->can_attach()
ss->attach() [ -> cpuset_attach() ]
cpuset_attach_task()
set_cpus_allowed_ptr();
while (child->state == TASK_WAKING)
cpu_relax();
will deadlock the system.
2) cgroup_clone (cpusets) vs copy_process
So even if the above would work we still have:
copy_process():
if (current->nsproxy != p->nsproxy)
ns_cgroup_clone()
cgroup_clone()
mutex_lock(inode->i_mutex)
mutex_lock(cgroup_mutex)
cgroup_attach_task()
ss->can_attach()
ss->attach() [ -> cpuset_attach() ]
cpuset_attach_task()
set_cpus_allowed_ptr();
...
p->cpus_allowed = current->cpus_allowed
over-writing the modified cpus_allowed.
3) fork() vs hotplug
if we unplug the child's cpu after the sanity check when the child
gets attached to the task_list but before wake_up_new_task() shit
will meet with fan.
Solve all these issues by moving fork cpu selection into
wake_up_new_task().
Reported-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1264106190.4283.1314.camel@laptop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: x86: Add support for the ANY bit
perf: Change the is_software_event() definition
perf: Honour event state for aux stream data
perf: Fix perf_event_do_pending() fallback callsite
perf kmem: Print usage help for unknown commands
perf kmem: Increase "Hit" column length
hw-breakpoints, perf: Fix broken mmiotrace due to dr6 by reference change
perf timechart: Use tid not pid for COMM change
Anton reported that perf record kept receiving events even after calling
ioctl(PERF_EVENT_IOC_DISABLE). It turns out that FORK,COMM and MMAP
events didn't respect the disabled state and kept flowing in.
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Anton Blanchard <anton@samba.org>
LKML-Reference: <1263459187.4244.265.camel@laptop>
CC: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul questioned the context in which we should call
perf_event_do_pending(). After looking at that I found that it should be
called from IRQ context these days, however the fallback call-site is
placed in softirq context. Ammend this by placing the callback in the IRQ
timer path.
Reported-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1263374859.4244.192.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the USER_SCHED feature. It has been scheduled to be removed in
2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2
Signed-off-by: Dhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1263990378.24844.3.camel@localhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We want to update the sched_group_powers when balance_cpu == this_cpu.
Currently the group powers are updated only if the balance_cpu is the
first CPU in the local group. But balance_cpu = this_cpu could also be
the first idle cpu in the group. Hence fix the place where the group
powers are updated.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1264017764.5717.127.camel@jschopp-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since all load_balance() callers will have !NULL balance parameters we
can now assume so and remove a few checks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The two functions: load_balance{,_newidle}() are very similar, with the
following differences:
- rq->lock usage
- sb->balance_interval updates
- *balance check
So remove the load_balance_newidle() call with load_balance(.idle =
CPU_NEWLY_IDLE), explicitly unlock the rq->lock before calling (would be
done by double_lock_balance() anyway), and ignore the other differences
for now.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
load_balance() and load_balance_newidle() look remarkably similar, one
key point they differ in is the condition on when to active balance.
So split out that logic into a separate function.
One side effect is that previously load_balance_newidle() used to fail
and return -1 under these conditions, whereas now it doesn't. I've not
yet fully figured out the whole -1 return case for either
load_balance{,_newidle}().
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since load-balancing can hold rq->locks for quite a long while, allow
breaking out early when there is lock contention.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move code around to get rid of fwd declarations.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Again, since we only iterate the fair class, remove the abstraction.
Since this is the last user of the rq_iterator, remove all that too.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since we only ever iterate the fair class, do away with this abstraction.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Take out the sched_class methods for load-balancing.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Straight fwd code movement.
Since non of the load-balance abstractions are used anymore, do away with
them and simplify the code some. In preparation move the code around.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Assume A->B schedule is processing, if B have acquired BKL before and it
need reschedule this time. Then on B's context, it will go to
need_resched_nonpreemptible for reschedule. But at this time, prev and
switch_count are related to A. It's wrong and will lead to incorrect
scheduler statistics.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <2674af741001102238w7b0ddcadref00d345e2181d11@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
enabled, leading to many cache misses on large machines as we traverse
looking for an idle shared cache to wake to. Change the enabler of
select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
sibling domain level.
Reported-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1262612696.15495.15.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Marc reported that the BUG_ON in clockevents_notify() triggers on his
system. This happens because the kernel tries to remove an active
clock event device (used for broadcasting) from the device list.
The handling of devices which can be used as per cpu device and as a
global broadcast device is suboptimal.
The simplest solution for now (and for stable) is to check whether the
device is used as global broadcast device, but this needs to be
revisited.
[ tglx: restored the cpuweight check and massaged the changelog ]
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
LKML-Reference: <1262834564-13033-1-git-send-email-dfeng@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
The smp ipi data is passed around and given write access by
other cpus and should be separated from per-cpu data consumed by
this cpu.
Looking for hot lines, I saw call_function_data shared with
tick_cpu_sched.
Signed-off-by: Milton Miller <miltonm@bga.com>
Acked-by: Anton Blanchard <anton@samba.org>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: : Nick Piggin <npiggin@suse.de>
LKML-Reference: <20100118020051.GR12666@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a task gets scheduled in. We don't touch the cpu bound events
so the priority order becomes:
cpu pinned, cpu flexible, task pinned, task flexible.
So schedule out cpu flexibles when a new task context gets in
and correctly order the groups to schedule in:
task pinned, cpu flexible, task flexible.
Cpu pinned groups don't need to be touched at this time.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
We don't need to schedule in/out pinned events on task tick,
now that pinned and flexible groups can be scheduled separately.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Tune the scheduling helpers so that we can choose to schedule either
pinned and/or flexible groups from a context.
And while at it, refactor a bit the naming of these helpers to make
these more consistent and flexible.
There is no (intended) change in scheduling behaviour in this
patch.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
__perf_event_sched_out doesn't need to be globally available, make
it static.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Update kprobe tracing self test for new syntax (it supports
deleting individual probes, and drops $argN support)
and behavior change (new probes are disabled in default).
This selftest includes the following checks:
- Adding function-entry probe and return probe with arguments.
- Enabling these probes.
- Deleting it individually.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100114051211.7814.29436.stgit@localhost6.localdomain6>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kernel/sched: don't expose local functions
The get_rr_interval_* functions are all class methods of
struct sched_class. They are not exported so make them
static.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <201001132021.53253.hartleys@visionengravers.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Each time we save a function entry from the function graph
tracer, we check if the trace array is set, which is wasteful
because it is set anyway before we start the tracer. All we need
is to ensure we have good read and write orderings. When we set
the trace array, we just need to guarantee it to be visible
before starting tracing.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
LKML-Reference: <1263453795-7496-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing/filters: Add comment for match callbacks
tracing/filters: Fix MATCH_FULL filter matching for PTR_STRING
tracing/filters: Fix MATCH_MIDDLE_ONLY filter matching
lib: Introduce strnstr()
tracing/filters: Fix MATCH_END_ONLY filter matching
tracing/filters: Fix MATCH_FRONT_ONLY filter matching
ftrace: Fix MATCH_END_ONLY function filter
tracing/x86: Derive arch from bits argument in recordmcount.pl
ring-buffer: Add rb_list_head() wrapper around new reader page next field
ring-buffer: Wrap a list.next reference with rb_list_head()
The change in acpi_cpufreq to use smp_call_function_any causes a warning
when it is called since the function erroneously passes the cpu id to
cpumask_of_node rather than the node that the cpu is on. Fix this.
cpumask_of_node(3): node > nr_node_ids(1)
Pid: 1, comm: swapper Not tainted 2.6.33-rc3-00097-g2c1f189 #223
Call Trace:
[<ffffffff81028bb3>] cpumask_of_node+0x23/0x58
[<ffffffff81061f51>] smp_call_function_any+0x65/0xfa
[<ffffffff810160d1>] ? do_drv_read+0x0/0x2f
[<ffffffff81015fba>] get_cur_val+0xb0/0x102
[<ffffffff81016080>] get_cur_freq_on_cpu+0x74/0xc5
[<ffffffff810168a7>] acpi_cpufreq_cpu_init+0x417/0x515
[<ffffffff81562ce9>] ? __down_write+0xb/0xd
[<ffffffff8148055e>] cpufreq_add_dev+0x278/0x922
Signed-off-by: David John <davidjon@xenontk.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On my first try using them I missed that the fifos need to be power of
two, resulting in a runtime bug. Document that requirement everywhere
(and fix one grammar bug)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Walls <awalls@radix.net>
Cc: Vikram Dhillon <dhillonv10@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some upcoming code it's useful to peek into a FIFO without permanentely
removing data. This patch implements a new kfifo_out_peek() to do this.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Walls <awalls@radix.net>
Cc: Vikram Dhillon <dhillonv10@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now for kfifo_*_user it's not easily possible to distingush between
a user copy failing and the FIFO not containing enough data. The problem
is that both conditions are multiplexed into the same return code.
Avoid this by moving the "copy length" into a separate output parameter
and only return 0/-EFAULT in the main return value.
I didn't fully adapt the weird "record" variants, those seem
to be unused anyways and were rather messy (should they be just removed?)
I would appreciate some double checking if I did all the conversions
correctly.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Walls <awalls@radix.net>
Cc: Vikram Dhillon <dhillonv10@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pointers to user buffers are currently unsigned char *, which requires
a lot of casting in the caller for any non-char typed buffers. Use void *
instead.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Walls <awalls@radix.net>
Cc: Vikram Dhillon <dhillonv10@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before scheduling an event group, we first check if a group can go
on. We first check if the group is made of software only events
first, in which case it is enough to know if the group can be
scheduled in.
For that purpose, we iterate through the whole group, which is
wasteful as we could do this check when we add/delete an event to
a group.
So we create a group_flags field in perf event that can host
characteristics from a group of events, starting with a first
PERF_GROUP_SOFTWARE flag that reduces the check on the fast path.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
This is more proper that doing it through a list_for_each_entry()
that breaks after the first entry.
v2: Don't rotate pinned groups as its not needed to time share
them.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Split-up struct perf_event_context::group_list into pinned_groups
and flexible_groups (non-pinned).
This first appears to be useless as it duplicates various loops around
the group list handlings.
But it scales better in the fast-path in perf_sched_in(). We don't
anymore iterate twice through the entire list to separate pinned and
non-pinned scheduling. Instead we interate through two distinct lists.
The another desired effect is that it makes easier to define distinct
scheduling rules on both.
Changes in v2:
- Respectively rename pinned_grp_list and
volatile_grp_list into pinned_groups and flexible_groups as per
Ingo suggestion.
- Various cleanups
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
We should be clear on 2 things:
- the length parameter of a match callback includes
tailing '\0'.
- the string to be searched might not be NULL-terminated.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8770.7000608@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
MATCH_FULL matching for PTR_STRING is not working correctly:
# echo 'func == vt' > events/bkl/lock_kernel/filter
# echo 1 > events/bkl/lock_kernel/enable
...
# cat trace
Xorg-1484 [000] 1973.392586: lock_kernel: ... func=vt_ioctl()
gpm-1402 [001] 1974.027740: lock_kernel: ... func=vt_ioctl()
We should pass to regex.match(..., len) the length (including '\0')
of the source string instead of the length of the pattern string.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8763.5070707@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The @str might not be NULL-terminated if it's of type
DYN_STRING or STATIC_STRING, so we should use strnstr()
instead of strstr().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8753.2000102@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For '*foo' pattern, we should allow any string ending with
'foo', but event filtering incorrectly disallows strings
like bar_foo_foo:
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8735.6070604@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
MATCH_FRONT_ONLY actually is a full matching:
# ./perf record -R -f -a -e lock:lock_acquire \
--filter 'name ~rcu_*' sleep 1
# ./perf trace
(no output)
We should pass the length of the pattern string to strncmp().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8721.5090301@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
perf_event_task_sched_in() expects interrupts to be disabled,
but on architectures with __ARCH_WANT_INTERRUPTS_ON_CTXSW
defined, this isn't true. If this is defined, disable irqs
around the call in finish_task_switch().
Signed-off-by: Jamie Iles <jamie.iles@picochip.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
LKML-Reference: <1262964453-27370-1-git-send-email-jamie.iles@picochip.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Drop function argument access syntax, because the function
arguments depend on not only architecture but also
compile-options and function API. And now, we have perf-probe
for finding register/memory assigned to each argument.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: linuxppc-dev@ozlabs.org
LKML-Reference: <20100105224648.19431.52309.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, futexes have two problem:
A) The current futex code doesn't handle private file mappings properly.
get_futex_key() uses PageAnon() to distinguish file and
anon, which can cause the following bad scenario:
1) thread-A call futex(private-mapping, FUTEX_WAIT), it
sleeps on file mapping object.
2) thread-B writes a variable and it makes it cow.
3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
wakes up blocked thread on the anonymous page. (but it's nothing)
B) Current futex code doesn't handle zero page properly.
Read mode get_user_pages() can return zero page, but current
futex code doesn't handle it at all. Then, zero page makes
infinite loop internally.
The solution is to use write mode get_user_page() always for
page lookup. It prevents the lookup of both file page of private
mappings and zero page.
Performance concerns:
Probaly very little, because glibc always initialize variables
for futex before to call futex(). It means glibc users never see
the overhead of this patch.
Compatibility concerns:
This patch has few compatibility issues. After this patch,
FUTEX_WAIT require writable access to futex variables (read-only
mappings makes EFAULT). But practically it's not a problem,
glibc always initalizes variables for futexes explicitly - nobody
uses read-only mappings.
Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Cc: <stable@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Ulrich Drepper <drepper@gmail.com>
LKML-Reference: <20100105162633.45A2.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Previously, each level of the rcu_node hierarchy had the same
rather unimaginative name: "&rcu_node_class[i]". This makes
lockdep diagnostics involving these lockdep classes less helpful
than would be nice. This patch fixes this by giving each level
of the rcu_node hierarchy a distinct name: "rcu_node_level_0",
"rcu_node_level_1", and so on. This version of the patch
includes improved diagnostics suggested by Josh Triplett and
Peter Zijlstra.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626498421830-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The rcu_process_dyntick() function checks twice for the end of
the current grace period. However, it holds the current
rcu_node structure's ->lock field throughout, and doesn't get to
the second call to rcu_gp_in_progress() unless there is at least
one CPU corresponding to this rcu_node structure that has not
yet checked in for the current grace period, which would prevent
the current grace period from ending. So the current grace
period cannot have ended, and the second check is redundant, so
remove it.
Also, given that this function is used even with !CONFIG_NO_HZ,
its name is quite misleading. Change from rcu_process_dyntick()
to force_qs_rnp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1262646550562-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The comparisons of rsp->gpnum nad rsp->completed in
rcu_process_dyntick() and force_quiescent_state() can be
replaced by the much more clear rcu_gp_in_progress() predicate
function. After doing this, it becomes clear that the
RCU_SAVE_COMPLETED leg of the force_quiescent_state() function's
switch statement is almost completely a no-op. A small change
to the RCU_SAVE_DYNTICK leg renders it a complete no-op, after
which it can be removed. Doing so also eliminates the forcenow
local variable from force_quiescent_state().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465501781-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because a new grace period cannot start while we are executing
within the force_quiescent_state() function's switch statement,
if any test within that switch statement or within any function
called from that switch statement shows that the current grace
period has ended, we can safely re-do that test any time before
we leave the switch statement. This means that we no longer
need a return value from rcu_process_dyntick(), as we can simply
invoke rcu_gp_in_progress() to check whether the old grace
period has finished -- there is no longer any need to worry
about whether or not a new grace period has been started.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465501857-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reduce the number and variety of race conditions by prohibiting
the start of a new grace period while force_quiescent_state() is
active. A new fqs_active flag in the rcu_state structure is used
to trace whether or not force_quiescent_state() is active, and
this new flag is tested by rcu_start_gp(). If the CPU that
closed out the last grace period needs another grace period,
this new grace period may be delayed up to one scheduling-clock
tick, but it will eventually get started.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <126264655052-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When print-fatal-signals is enabled it's possible to dump any memory
reachable by the kernel to the log by simply jumping to that address from
user space.
Or crash the system if there's some hardware with read side effects.
The fatal signals handler will dump 16 bytes at the execution address,
which is fully controlled by ring 3.
In addition when something jumps to a unmapped address there will be up to
16 additional useless page faults, which might be potentially slow (and at
least is not very efficient)
Fortunately this option is off by default and only there on i386.
But fix it by checking for kernel addresses and also stopping when there's
a page fault.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The LTP cgroup test suite generates a "kernel BUG at kernel/cgroup.c:790!"
here in cgroup_diput():
/*
* if we're getting rid of the cgroup, refcount should ensure
* that there are no pidlists left.
*/
BUG_ON(!list_empty(&cgrp->pidlists));
The cgroup pidlist rework in 2.6.32 generates the BUG_ON, which is caused
when pidlist_array_load() calls cgroup_pidlist_find():
(1) if a matching cgroup_pidlist is found, it down_write's the mutex of the
pre-existing cgroup_pidlist, and increments its use_count.
(2) if no matching cgroup_pidlist is found, then a new one is allocated, it
down_write's its mutex, and the use_count is set to 0.
(3) the matching, or new, cgroup_pidlist gets returned back to pidlist_array_load(),
which increments its use_count -- regardless whether new or pre-existing --
and up_write's the mutex.
So if a matching list is ever encountered by cgroup_pidlist_find() during
the life of a cgroup directory, it results in an inflated use_count value,
preventing it from ever getting released by cgroup_release_pid_array().
Then if the directory is subsequently removed, cgroup_diput() hits the
BUG_ON() when it finds that the directory's cgroup is still populated with
a pidlist.
The patch simply removes the use_count increment when a matching pidlist
is found by cgroup_pidlist_find(), because it gets bumped by the calling
pidlist_array_load() function while still protected by the list's mutex.
Signed-off-by: Dave Anderson <anderson@redhat.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Paul Menage <menage@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix resource (write-pipe file) leak in call_usermodehelper_pipe().
When call_usermodehelper_exec() fails, write-pipe file is opened and
call_usermodehelper_pipe() just returns an error. Since it is hard for
caller to determine whether the error occured when opening the pipe or
executing the helper, the caller cannot close the pipe by themselves.
I've found this resoruce leak when testing coredump. You can check how
the resource leaks as below;
$ echo "|nocommand" > /proc/sys/kernel/core_pattern
$ ulimit -c unlimited
$ while [ 1 ]; do ./segv; done &> /dev/null &
$ cat /proc/meminfo (<- repeat it)
where segv.c is;
//-----
int main () {
char *p = 0;
*p = 1;
}
//-----
This patch closes write-pipe file if call_usermodehelper_exec() failed.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the very unlikely case happens where the writer moves the head by one
between where the head page is read and where the new reader page
is assigned _and_ the writer then writes and wraps the entire ring buffer
so that the head page is back to what was originally read as the head page,
the page to be swapped will have a corrupted next pointer.
Simple solution is to wrap the assignment of the next pointer with a
rb_list_head().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This reference at the end of rb_get_reader_page() was causing off-by-one
writes to the prev pointer of the page after the reader page when that
page is the head page, and therefore the reader page has the RB_PAGE_HEAD
flag in its list.next pointer. This eventually results in a GPF in a
subsequent call to rb_set_head_page() (usually from rb_get_reader_page())
when that prev pointer is dereferenced. The dereferenced register would
characteristically have an address that appears shifted left by one byte
(eg, ffxxxxxxxxxxxxyy instead of ffffxxxxxxxxxxxx) due to being written at
an address one byte too high.
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1262826727-9090-1-git-send-email-dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the ftrace stacktrace option is set, then add the stack dumps to
trace_printk.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
At the beginning, access to the ring buffer was fully serialized
by trace_types_lock. Patch d7350c3f45 gives more freedom to readers,
and patch b04cc6b1f6 adds code to protect trace_pipe and cpu#/trace_pipe.
But actually it is not enough, ring buffer readers are not always
read-only, they may consume data.
This patch makes accesses to trace, trace_pipe, trace_pipe_raw
cpu#/trace, cpu#/trace_pipe and cpu#/trace_pipe_raw serialized.
And removes tracing_reader_cpumask which is used to protect trace_pipe.
Details:
Ring buffer serializes readers, but it is low level protection.
The validity of the events (which returns by ring_buffer_peek() ..etc)
are not protected by ring buffer.
The content of events may become garbage if we allow another process to consume
these events concurrently:
A) the page of the consumed events may become a normal page
(not reader page) in ring buffer, and this page will be rewritten
by the events producer.
B) The page of the consumed events may become a page for splice_read,
and this page will be returned to system.
This patch adds trace_access_lock() and trace_access_unlock() primitives.
These primitives allow multi process access to different cpu ring buffers
concurrently.
These primitives don't distinguish read-only and read-consume access.
Multi read-only access is also serialized.
And we don't use these primitives when we open files,
we only use them when we read files.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B447D52.1050602@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The previous patches added the use of print_fmt string and changes
the trace_define_field() function to also create the fields and
format output for the event format files.
text data bss dec hex filename
5857201 1355780 9336808 16549789 fc879d vmlinux
5884589 1351684 9337896 16574169 fce6d9 vmlinux-orig
The above shows the size of the vmlinux after this patch set
compared to the vmlinux-orig which is before the patch set.
This saves us 27k on text, 1k on bss and adds just 4k of data.
The total savings of 24k in size.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D4D.40604@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The calls ftrace_format_##call() and ftrace_define_fields_##call()
are almost duplicate in functionality. With the addition of the
print_fmt in previous patches, these two functions can be merged
into one.
The trace_define_field() defines the fields and links them into
the struct ftrace_event_call. The previous patches introduced
the print_fmt field and this can now be used with the trace_define_field()
to create the event format file fields and print_fmt field.
The struct ftrace_event_call->fields are used to print the fields
The struct ftrace_event_call->print_fmt is used to print
the "print fmt: XXXXXXXXXXX" line.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D49.5000006@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the clean up of having all events call one specific function,
the syscall event init was changed to call this helper function.
With the new print_fmt updates, the syscalls need to do special
initializations. This patch converts the syscall events to call
its own init function again.
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This is part of a patch set that removes the show_format method
in the ftrace event macros.
Add the print_fmt initialization to the kprobe events.
The print_fmt is still not used, but will be in the follow up
patches.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D45.3080100@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This is part of a patch set that removes the show_format method
in the ftrace event macros.
Add the print_fmt initialization to the syscall events.
The print_fmt is still not used, but will be in the follow up
patches.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D41.609@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This is part of a patch set that removes the show_format method
in the ftrace event macros.
The print_fmt field is added to hold the string that shows
the print_fmt in the event format files. This patch only adds
the field but it is currently not used. Later patches will use
this field to enable us to remove the show_format field
and function.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D3E.2000704@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This is part of a patch set that removes the show_format method
in the ftrace event macros.
This patch set requires that all fields are added to the
ftrace_event_call->fields. This patch changes __dynamic_array()
to call trace_define_field() to include fields that use __dynamic_array().
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D36.8090100@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 35dead4 "modules: don't export section names of empty sections
via sysfs" changed the set of sections that have attributes, but did
not change the iteration over these attributes in add_notes_attrs().
This can lead to add_notes_attrs() creating attributes with the wrong
names or with null name pointers.
Introduce a sect_empty() function and use it in both add_sect_attrs()
and add_notes_attrs().
Reported-by: Martin Michlmayr <tbm@cyrius.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Tested-by: Martin Michlmayr <tbm@cyrius.com>
Cc: stable@kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces an interface to process data objects
in parallel. The parallelized objects return after serialization
in the same order as they were before the parallelization.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
ringbuffer*.c are the last users of local.h.
Remove the include from modules.h and add it to ringbuffer files.
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
requirements, cache footprint and decreases cycle counts.
The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
not drop the use of __module_ref_addr() which would make per cpu data handling
complicated. this_cpu_xx operations have their own fallback for !SMP.
V8-V9:
- Leave include asm/module.h since ringbuffer.c depends on it. Nothing else
does though. Another patch will deal with that.
- Remove spurious free.
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf kmem: Fix statistics typo
kprobes: Fix distinct type warning
perf: Rename perf_event_hw_event in design document
perf tools: Add missing header files to LIB_H Makefile variable
perf record: We should fork only if a program was specified to run
perf diff: Fix usage array, it must end with a NULL entry
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Fix sign fields in ftrace_define_fields_##call()
tracing/syscalls: Fix typo in SYSCALL_DEFINE0
tracing/kprobe: Show sign of fields in trace_kprobe format files
ksym_tracer: Remove trace_stat
ksym_tracer: Fix race when incrementing count
ksym_tracer: Fix to allow writing newline to ksym_trace_filter
ksym_tracer: Fix to make the tracer work
tracing: Kconfig spelling fixes and cleanups
tracing: Fix setting tracer specific options
Documentation: Update ftrace-design.txt
Documentation: Update tracepoint-analysis.txt
Documentation: Update mmiotrace.txt
crash_kexec gets called before kmsg_dump(KMSG_DUMP_OOPS) if
panic_on_oops is set, so the kernel log buffer is not stored
for this case.
This patch adds a KMSG_DUMP_KEXEC dump type which gets called
when crash_kexec() is invoked. To avoid getting double dumps,
the old KMSG_DUMP_PANIC is moved below crash_kexec(). The
mtdoops driver is modified to handle KMSG_DUMP_KEXEC in the
same way as a panic.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Liming found a NULL deref when a task has a perf context but no
counters when it forks.
This can occur in two cases, a race during construction where
the fork hits after installing the context but before the first
counter gets inserted, or more reproducably, a fork after the
last counter is closed (which leaves the context around).
Reported-by: Wang Liming <liming.wang@windriver.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
CC: <stable@kernel.org>
LKML-Reference: <1262185684.7135.222.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add is_signed_type() call to trace_define_field() in ftrace macros.
The code previously just passed in 0 (false), disregarding whether
or not the field was actually a signed type.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D3A.6020007@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The format files of trace_kprobe do not show the sign of the fields.
The other format files show the field signed type of the fields and
this patch makes the trace_kprobe formats consistent with the others.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D27.5040009@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_stat is problematic. Don't use it, use seqfile instead.
This fixes a race that reading the stat file is not protected by
any lock, which can lead to use after free.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B3AF203.40200@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We are under rcu read section but not holding the write lock, so
count++ is not atomic. Use atomic64_t instead.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B3AF1EC.9010608@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It used to work, but now doesn't:
# echo > ksym_filter
bash: echo: write error: Invalid argument
It's caused by d954fbf0ff
("tracing: Fix wrong usage of strstrip in trace_ksyms").
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B3AF1D7.5040400@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ksym tracer doesn't work:
# echo tasklist_lock:rw- > ksym_trace_filter
-bash: echo: write error: No such device
It's because we pass to perf_event_create_kernel_counter()
a cpu number which is not present.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B3AF19E.1010201@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fixes a warning when building with g++:
warning: deprecated conversion from string constant to 'char*'
And the file parameter use is constant, so mark it as such.
Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Cc: peterz@infradead.org
LKML-Reference: <20091223110818.442d848e@marrow.netinsight.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix filename reference (ftrace-implementation.txt ->
ftrace-design.txt).
Fix spelling, punctuation, grammar.
Fix help text indentation and line lengths to reduce need for
horizontal scrolling or larger window sizes.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091221120117.3fb49cdc.randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Quoted from Ingo:
| This reminds me - i think we should eliminate CONFIG_EVENT_PROFILE -
| it's an unnecessary Kconfig complication. If both PERF_EVENTS and
| EVENT_TRACING is enabled we should expose generic tracepoints.
|
| Nor is it limited to event 'profiling', so it has become a misnomer as
| well.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B2F1557.2050705@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Every time I see this:
kernel/kprobes.c: In function 'register_kretprobe':
kernel/kprobes.c:1038: warning: comparison of distinct pointer types lacks a cast
I'm wondering if something changed in common code and we need to
do something for s390. Apparently that's not the case.
Let's get rid of this annoying warning.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <20091221120224.GA4471@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since we only ever schedule the local cpu, there is no need to pass the
cpu number to the perf sched hooks.
This micro-optimizes things a bit.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6:
SYSCTL: Add a mutex to the page_alloc zone order sysctl
SYSCTL: Print binary sysctl warnings (nearly) only once
When printing legacy sysctls print the warning message
for each of them only once. This way there is a guarantee
the syslog won't be flooded for any sane program.
The original attempt at this made the tables non const and stored
the flag inline.
Linus suggested using a separate hash table for this, this is based on a
code snippet from him.
The hash implies this is not exact and can sometimes not print a
new sysctl due to a hash collision, but in practice this should not
be a problem
I used a FNV32 hash over the binary string with a 32byte bitmap. This
gives relatively little collisions when all the predefined binary sysctls
are hashed:
size 256
bucket
length number
0: [25]
1: [67]
2: [88]
3: [47]
4: [22]
5: [6]
6: [1]
The worst case is a single collision of 6 hash values.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Effectively reverts 738d2be430.
As demonstrated by Eric, we really need to call __set_task_cpu()
early in the fork() path to properly initialize the various task
state -- specifically the cgroup state through set_task_rq().
[ we could probably fix this by explicitly calling
__set_task_cpu() from sched_fork(), but lets try that for the
next cycle and simply revert to the old behaviour for now. ]
Reported-by: Eric Paris <eparis@redhat.com>
Tested-by: Eric Paris <eparis@redhat.com>,
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: efault@gmx.de
LKML-Reference: <1261492999.4937.36.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add kfifo_in_rec() - puts some record data into the FIFO
Add kfifo_out_rec() - gets some record data from the FIFO
Add kfifo_from_user_rec() - puts some data from user space into the FIFO
Add kfifo_to_user_rec() - gets data from the FIFO and write it to user space
Add kfifo_peek_rec() - gets the size of the next FIFO record field
Add kfifo_skip_rec() - skip the next fifo out record
Add kfifo_avail_rec() - determinate the number of bytes available in a record FIFO
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add kfifo_reset_out() for save lockless discard the fifo output
Add kfifo_skip() to skip a number of output bytes
Add kfifo_from_user() to copy user space data into the fifo
Add kfifo_to_user() to copy fifo data to user space
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rename kfifo_put... into kfifo_in... to prevent miss use of old non in
kernel-tree drivers
ditto for kfifo_get... -> kfifo_out...
Improve the prototypes of kfifo_in and kfifo_out to make the kerneldoc
annotations more readable.
Add mini "howto porting to the new API" in kfifo.h
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the pointer to the spinlock out of struct kfifo. Most users in
tree do not actually use a spinlock, so the few exceptions now have to
call kfifo_{get,put}_locked, which takes an extra argument to a
spinlock.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a new generic kernel FIFO implementation.
The current kernel fifo API is not very widely used, because it has to
many constrains. Only 17 files in the current 2.6.31-rc5 used it.
FIFO's are like list's a very basic thing and a kfifo API which handles
the most use case would save a lot of development time and memory
resources.
I think this are the reasons why kfifo is not in use:
- The API is to simple, important functions are missing
- A fifo can be only allocated dynamically
- There is a requirement of a spinlock whether you need it or not
- There is no support for data records inside a fifo
So I decided to extend the kfifo in a more generic way without blowing up
the API to much. The new API has the following benefits:
- Generic usage: For kernel internal use and/or device driver.
- Provide an API for the most use case.
- Slim API: The whole API provides 25 functions.
- Linux style habit.
- DECLARE_KFIFO, DEFINE_KFIFO and INIT_KFIFO Macros
- Direct copy_to_user from the fifo and copy_from_user into the fifo.
- The kfifo itself is an in place member of the using data structure, this save an
indirection access and does not waste the kernel allocator.
- Lockless access: if only one reader and one writer is active on the fifo,
which is the common use case, no additional locking is necessary.
- Remove spinlock - give the user the freedom of choice what kind of locking to use if
one is required.
- Ability to handle records. Three type of records are supported:
- Variable length records between 0-255 bytes, with a record size
field of 1 bytes.
- Variable length records between 0-65535 bytes, with a record size
field of 2 bytes.
- Fixed size records, which no record size field.
- Preserve memory resource.
- Performance!
- Easy to use!
This patch:
Since most users want to have the kfifo as part of another object,
reorganize the code to allow including struct kfifo in another data
structure. This requires changing the kfifo_alloc and kfifo_init
prototypes so that we pass an existing kfifo pointer into them. This
patch changes the implementation and all existing users.
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 7bc7d63745, as
requested by John Stultz. Quoting John:
"Petr Titěra reported an issue where he saw odd atime regressions with
2.6.33 where there were a full second worth of nanoseconds in the
nanoseconds field.
He also reviewed the time code and narrowed down the problem: unhandled
overflow of the nanosecond field caused by rounding up the
sub-nanosecond accumulated time.
Details:
* At the end of update_wall_time(), we currently round up the
sub-nanosecond portion of accumulated time when storing it into xtime.
This was added to avoid time inconsistencies caused when the
sub-nanosecond portion was truncated when storing into xtime.
Unfortunately we don't handle the possible second overflow caused by
that rounding.
* Previously the xtime_cache code hid this overflow by normalizing the
xtime value when storing into the xtime_cache.
* We could try to handle the second overflow after the rounding up, but
since this affects the timekeeping's internal state, this would further
complicate the next accumulation cycle, causing small errors in ntp
steering. As much as I'd like to get rid of it, the xtime_cache code is
known to work.
* The correct fix is really to include the sub-nanosecond portion in the
timekeeping accessor function, so we don't need to round up at during
accumulation. This would greatly simplify the accumulation code.
Unfortunately, we can't do this safely until the last three
non-GENERIC_TIME arches (sparc32, arm, cris) are converted (those
patches are in -mm) and we kill off the spots where arches set xtime
directly. This is all 2.6.34 material, so I think reverting the
xtime_cache change is the best approach for now.
Many thanks to Petr for both reporting and finding the issue!"
Reported-by: Petr Titěra <P.Titera@century.cz>
Requested-by: john stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pull ACC_MODE to fs.h; we have several copies all over the place
* nightmarish expression calculating f_mode by f_flags deserves a helper
too (OPEN_FMODE(flags))
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It seems a couple places such as arch/ia64/kernel/perfmon.c and
drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile()
instead of a private pseudo-fs + alloc_file(), if only there were a way
to get a read-only file. So provide this by having anon_inode_getfile()
create a read-only file if we pass O_RDONLY in flags.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The function __set_tracer_option() takes as its last parameter a
"neg" value. If set it should negate the value of the option.
The trace_options_write() passed the value written to the file
which is what the new value needs to be set as. But since this
is not the negative, it never sets the value.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The second parameter to alignf() in allocate_resource() must
reflect what new resource is attempted to be allocated, else
functions like pcibios_align_resource() (at least on x86) or
pcmcia_align() can't work correctly.
Commit 1e5ad96790 broke this by
setting the "new" resource until we're about to return success.
To keep the resource untouched when allocate_resource() fails,
a "tmp" resource is introduced.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The hot-unplug kstopmachine usage does a wakeup after
deactivating the cpu, hence we cannot use cpu_active()
here but must rely on the good olde online.
Reported-by: Sachin Sant <sachinp@in.ibm.com>
Reported-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <1261326987.4314.24.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Revert the braindead pr_* crap. (Commit 663997d "sched: Use
pr_fmt() and pr_<level>()")
It's dumb and causes stupid "sched: " strings all over the place.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: Joe Perches <joe@perches.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1261315437.4314.6.camel@laptop>
[ i dont mind the pr_*() patterns that much - but Peter dislikes them with a vengence. ]
[ - v2: remove spurious diffstat from changelog :-/ ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf session: Make events_stats u64 to avoid overflow on 32-bit arches
hw-breakpoints: Fix hardware breakpoints -> perf events dependency
perf events: Dont report side-band events on each cpu for per-task-per-cpu events
perf events, x86/stacktrace: Fix performance/softlockup by providing a special frame pointer-only stack walker
perf events, x86/stacktrace: Make stack walking optional
perf events: Remove unused perf_counter.h header file
perf probe: Check new event name
kprobe-tracer: Check new event/group name
perf probe: Check whether debugfs path is correct
perf probe: Fix libdwarf include path for Debian
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
sched: Fix broken assertion
sched: Assert task state bits at build time
sched: Update task_state_arraypwith new states
sched: Add missing state chars to TASK_STATE_TO_CHAR_STR
sched: Move TASK_STATE_TO_CHAR_STR near the TASK_state bits
sched: Teach might_sleep() about preemptible RCU
sched: Make warning less noisy
sched: Simplify set_task_cpu()
sched: Remove the cfs_rq dependency from set_task_cpu()
sched: Add pre and post wakeup hooks
sched: Move kthread_bind() back to kthread.c
sched: Fix select_task_rq() vs hotplug issues
sched: Fix sched_exec() balancing
sched: Ensure set_task_cpu() is never called on blocked tasks
sched: Use TASK_WAKING for fork wakups
sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE
sched: Fix task_hot() test order
sched: Fix set_cpu_active() in cpu_down()
sched: Mark boot-cpu active before smp_init()
sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
...
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sys: Fix missing rcu protection for __task_cred() access
signals: Fix more rcu assumptions
signal: Fix racy access to __task_cred in kill_pid_info_as_uid()
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: Remove duplicate setting of new_base in __mod_timer()
clockevents: Prevent clockevent_devices list corruption on cpu hotplug
Several leaks in audit_tree didn't get caught by commit
318b6d3d7d, including the leak on normal
exit in case of multiple rules refering to the same chunk.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... aka "Al had badly fscked up when writing that thing and nobody
noticed until Eric had fixed leaks that used to mask the breakage".
The function essentially creates a copy of old array sans one element
and replaces the references to elements of original (they are on cyclic
lists) with those to corresponding elements of new one. After that the
old one is fair game for freeing.
First of all, there's a dumb braino: when we get to list_replace_init we
use indices for wrong arrays - position in new one with the old array
and vice versa.
Another bug is more subtle - termination condition is wrong if the
element to be excluded happens to be the last one. We shouldn't go
until we fill the new array, we should go until we'd finished the old
one. Otherwise the element we are trying to kill will remain on the
cyclic lists...
That crap used to be masked by several leaks, so it was not quite
trivial to hit. Eric had fixed some of those leaks a while ago and the
shit had hit the fan...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
cpumask: rename tsk_cpumask to tsk_cpus_allowed
cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt
cpumask: avoid dereferencing struct cpumask
cpumask: convert drivers/idle/i7300_idle.c to cpumask_var_t
cpumask: use modern cpumask style in drivers/scsi/fcoe/fcoe.c
cpumask: avoid deprecated function in mm/slab.c
cpumask: use cpu_online in kernel/perf_event.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support
NOMMU: Optimise away the {dac_,}mmap_min_addr tests
security/min_addr.c: make init_mmap_min_addr() static
keys: PTR_ERR return of wrong pointer in keyctl_get_security()
* 'kmemleak' of git://linux-arm.org/linux-2.6:
kmemleak: fix kconfig for crc32 build error
kmemleak: Reduce the false positives by checking for modified objects
kmemleak: Show the age of an unreferenced object
kmemleak: Release the object lock before calling put_object()
kmemleak: Scan the _ftrace_events section in modules
kmemleak: Simplify the kmemleak_scan_area() function prototype
kmemleak: Do not use off-slab management with SLAB_NOLEAKTRACE
Fix kernel-doc warnings in printk.c:
Warning(kernel/printk.c:1422): No description found for parameter 'dumper'
Warning(kernel/printk.c:1422): Excess function parameter 'dump' description in 'kmsg_dump_register'
Warning(kernel/printk.c:1451): No description found for parameter 'dumper'
Warning(kernel/printk.c:1451): Excess function parameter 'dump' description in 'kmsg_dump_unregister'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thanks to Roland who pointed out de_thread() issues.
Currently we add sub-threads to ->real_parent->children list. This buys
nothing but slows down do_wait().
With this patch ->children contains only main threads (group leaders).
The only complication is that forget_original_parent() should iterate over
sub-threads by hand, and de_thread() needs another list_replace() when it
changes ->group_leader.
Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
tasks, we can remove this check (and we can unify do_wait_thread() and
ptrace_do_wait()).
This change can confuse the optimistic search in mm_update_next_owner(),
but this is fixable and minor.
Perhaps badness() and oom_kill_process() should be updated, but they
should be fixed in any case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is a mistake that we used 'proc_dointvec', it should be
'proc_dointvec_minmax', as in the original patch.
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-33' of git://repo.or.cz/linux-kbuild: (29 commits)
net: fix for utsrelease.h moving to generated
gen_init_cpio: fixed fwrite warning
kbuild: fix make clean after mismerge
kbuild: generate modules.builtin
genksyms: properly consider EXPORT_UNUSED_SYMBOL{,_GPL}()
score: add asm/asm-offsets.h wrapper
unifdef: update to upstream revision 1.190
kbuild: specify absolute paths for cscope
kbuild: create include/generated in silentoldconfig
scripts/package: deb-pkg: use fakeroot if available
scripts/package: add KBUILD_PKG_ROOTCMD variable
scripts/package: tar-pkg: use tar --owner=root
Kbuild: clean up marker
net: add net_tstamp.h to headers_install
kbuild: move utsrelease.h to include/generated
kbuild: move autoconf.h to include/generated
drop explicit include of autoconf.h
kbuild: move compile.h to include/generated
kbuild: drop include/asm
kbuild: do not check for include/asm-$ARCH
...
Fixed non-conflicting clean merge of modpost.c as per comments from
Stephen Rothwell (modpost.c had grown an include of linux/autoconf.h
that needed to be changed to generated/autoconf.h)
There's a preemption race in the set_task_cpu() debug check in
that when we get preempted after setting task->state we'd still
be on the rq proper, but fail the test.
Check for preempted tasks, since those are always on the RQ.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20091217121830.137155561@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acme noticed that his FORK/MMAP numbers were inflated by about
the same factor as his cpu-count.
This led to the discovery of a few more sites that need to
respect the event->cpu filter.
Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091217121830.215333434@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current print_context_stack helper that does the stack
walking job is good for usual stacktraces as it walks through
all the stack and reports even addresses that look unreliable,
which is nice when we don't have frame pointers for example.
But we have users like perf that only require reliable
stacktraces, and those may want a more adapted stack walker, so
lets make this function a callback in stacktrace_ops that users
can tune for their needs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1261024834-5336-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In practice, it is harmless to voluntarily sleep in a
rcu_read_lock() section if we are running under preempt rcu, but
it is illegal if we build a kernel running non-preemptable rcu.
Currently, might_sleep() doesn't notice sleepable operations
under rcu_read_lock() sections if we are running under
preemptable rcu because preempt_count() is left untouched after
rcu_read_lock() in this case. But we want developers who test
their changes under such config to notice the "sleeping while
atomic" issues.
So we add rcu_read_lock_nesting to prempt_count() in
might_sleep() checks.
[ v2: Handle rcu-tiny ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1260991265-8451-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Check new event/group name is same syntax as a C symbol. In other
words, checking the name is as like as other tracepoint events.
This can prevent user to create an event with useless name (e.g.
foo|bar, foo*bar).
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
LKML-Reference: <20091216222408.14459.68790.stgit@dhcp-100-2-132.bos.redhat.com>
[ v2: minor cleanups ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
struct cpumask will be undefined soon with CONFIG_CPUMASK_OFFSTACK=y,
to avoid them being declared on the stack.
cpumask_bits() does what we want here (of course, this code is crap).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: Thomas Gleixner <tglx@linutronix.de>
Also, we want to check against nr_cpu_ids, not num_possible_cpus().
The latter works, but the correct bounds check is < nr_cpu_ids.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: Thomas Gleixner <tglx@linutronix.de>
new_base is set using per_cpu(tvec_bases, cpu) after selecting the
desired value of cpu immediately below so this line is a unnecessary.
Signed-off-by: Simon Horman <horms@verge.net.au>
LKML-Reference: <20091217001542.GD25317@verge.net.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
skipped by the compiler. We do this as the minimum mmap address doesn't make
any sense in NOMMU mode.
mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
As predicted during code review, the sysctl(2) changes made systems with
old glibc nearly unusable. About every command gives a:
warning: process `ls' used the deprecated sysctl system call with 1.4
warning in the log.
I see this on a SUSE 10.0 system with glibc 2.3.5.
Don't warn for this common case.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (52 commits)
perf record: Use per-task-per-cpu events for inherited events
perf record: Properly synchronize child creation
perf events: Allow per-task-per-cpu counters
perf diff: Percent calcs should use double values
perf diff: Change the default sort order to "dso,symbol"
perf diff: Use perf_session__fprintf_hists just like 'perf record'
perf report: Fix cut'n'paste error recently introduced
perf session: Move perf report specific hits out of perf_session__fprintf_hists
perf tools: Move hist entries printing routines from perf report
perf report: Generalize perf_session__fprintf_hists()
perf symbols: Move symbol filtering to event__preprocess_sample()
perf symbols: Adopt the strlists for dso, comm
perf symbols: Make symbol_conf global
perf probe: Fix to show which probe point is not found
perf probe: Check symbols in symtab/kallsyms
perf probe: Check build-id of vmlinux
perf probe: Reject second attempt of adding same-name event
perf probe: Support event name for --add option
perf probe: Add glob matching support on --del
perf probe: Use strlist__for_each macros in probe-event.c
...
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Fix return of trace_dump_stack()
ksym_tracer: Fix bad cast
tracing/power: Remove two exports
tracing: Change event->profile_count to be int type
tracing: Simplify trace_option_write()
tracing: Remove useless trace option
tracing: Use seq file for trace_clock
tracing: Use seq file for trace_options
function-graph: Allow writing the same val to set_graph_function
ftrace: Call trace_parser_clear() properly
ftrace: Return EINVAL when writing invalid val to set_ftrace_filter
tracing: Move a printk out of ftrace_raw_reg_event_foo()
tracing: Pull up calls to trace_define_common_fields()
tracing: Extract duplicate ftrace_raw_init_event_foo()
ftrace.h: Use common pr_info fmt string
tracing: Add stack trace to irqsoff tracer
tracing: Add trace_dump_stack()
ring-buffer: Move resize integrity check under reader lock
ring-buffer: Use sync sched protection on ring buffer resizing
tracing: Fix wrong usage of strstrip in trace_ksyms
* 'module' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
modpost: fix segfault with short symbol names
module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y
Kbuild: clear marker out of modpost
module: make MODULE_SYMBOL_PREFIX into a CONFIG option
ARM: unexport symbols used to implement floating point emulation
ARM: use unified discard definition in linker script
x86: don't export inline function
sparc64: don't export static inline pci_ functions
* git://git.infradead.org/mtd-2.6: (90 commits)
jffs2: Fix long-standing bug with symlink garbage collection.
mtd: OneNAND: Fix test of unsigned in onenand_otp_walk()
mtd: cfi_cmdset_0002, fix lock imbalance
Revert "mtd: move mxcnd_remove to .exit.text"
mtd: m25p80: add support for Macronix MX25L4005A
kmsg_dump: fix build for CONFIG_PRINTK=n
mtd: nandsim: add support for 4KiB pages
mtd: mtdoops: refactor as a kmsg_dumper
mtd: mtdoops: make record size configurable
mtd: mtdoops: limit the maximum mtd partition size
mtd: mtdoops: keep track of used/unused pages in an array
mtd: mtdoops: several minor cleanups
core: Add kernel message dumper to call on oopses and panics
mtd: add ARM pismo support
mtd: pxa3xx_nand: Fix PIO data transfer
mtd: nand: fix multi-chip suspend problem
mtd: add support for switching old SST chips into QRY mode
mtd: fix M29W800D dev_id and uaddr
mtd: don't use PF_MEMALLOC
mtd: Add bad block table overrides to Davinci NAND driver
...
Fixed up conflicts (mostly trivial) in
drivers/mtd/devices/m25p80.c
drivers/mtd/maps/pcmciamtd.c
drivers/mtd/nand/pxa3xx_nand.c
kernel/printk.c
Rearrange code a bit now that its a simpler function.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.269101883@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to remove the cfs_rq dependency from set_task_cpu() we
need to ensure the task is cfs_rq invariant for all callsites.
The simple approach is to substract cfs_rq->min_vruntime from
se->vruntime on dequeue, and add cfs_rq->min_vruntime on
enqueue.
However, this has the downside of breaking FAIR_SLEEPERS since
we loose the old vruntime as we only maintain the relative
position.
To solve this, we observe that we only migrate runnable tasks,
we do this using deactivate_task(.sleep=0) and
activate_task(.wakeup=0), therefore we can restrain the
min_vruntime invariance to that state.
The only other case is wakeup balancing, since we want to
maintain the old vruntime we cannot make it relative on dequeue,
but since we don't migrate inactive tasks, we can do so right
before we activate it again.
This is where we need the new pre-wakeup hook, we need to call
this while still holding the old rq->lock. We could fold it into
->select_task_rq(), but since that has multiple callsites and
would obfuscate the locking requirements, that seems like a
fudge.
This leaves the fork() case, simply make sure that ->task_fork()
leaves the ->vruntime in a relative state.
This covers all cases where set_task_cpu() gets called, and
ensures it sees a relative vruntime.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.191697025@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As will be apparent in the next patch, we need a pre wakeup hook
for sched_fair task migration, hence rename the post wakeup hook
and one pre wakeup.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.114746117@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since kthread_bind() lost its dependencies on sched.c, move it
back where it came from.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.039524041@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since select_task_rq() is now responsible for guaranteeing
->cpus_allowed and cpu_active_mask, we need to verify this.
select_task_rq_rt() can blindly return
smp_processor_id()/task_cpu() without checking the valid masks,
select_task_rq_fair() can do the same in the rare case that all
SD_flags are disabled.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.961475466@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since we access ->cpus_allowed without holding rq->lock we need
a retry loop to validate the result, this comes for near free
when we merge sched_migrate_task() into sched_exec() since that
already does the needed check.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.884743662@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to clean up the set_task_cpu() rq dependencies we need
to ensure it is never called on blocked tasks because such usage
does not pair with consistent rq->lock usage.
This puts the migration burden on ttwu().
Furthermore we need to close a race against changing
->cpus_allowed, since select_task_rq() runs with only preemption
disabled.
For sched_fork() this is safe because the child isn't in the
tasklist yet, for wakeup we fix this by synchronizing
set_cpus_allowed_ptr() against TASK_WAKING, which leaves
sched_exec to be a problem
This also closes a hole in (6ad4c1888 sched: Fix balance vs
hotplug race) where ->select_task_rq() doesn't validate the
result against the sched_domain/root_domain.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.807938893@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For later convenience use TASK_WAKING for fresh tasks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.732561278@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make sure not to access sched_fair fields before verifying it is
indeed a sched_fair task.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
CC: stable@kernel.org
LKML-Reference: <20091216170517.577998058@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Sachin found cpu hotplug test failures on powerpc, which made
the kernel hang on his POWER box.
The problem is that we fail to re-activate a cpu when a
hot-unplug fails. Fix this by moving the de-activation into
_cpu_down after doing the initial checks.
Remove the synchronize_sched() calls and rely on those implied
by rebuilding the sched domains using the new mask.
Reported-by: Sachin Sant <sachinp@in.ibm.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Tested-by: Sachin Sant <sachinp@in.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.500272612@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to allow for per-task-per-cpu counters, useful for
scalability when profiling task hierarchies, we allow installing
events with event->cpu != -1 in task contexts.
__perf_event_sched_in() already skips events where ->cpu
mis-matches the current cpu, fix up __perf_install_in_context()
and __perf_event_enable() to also respect this filter.
This does lead to vary hard to interpret enabled/running times
for such counters, but I don't see a simple solution for that.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: fweisbec@gmail.com
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091216165904.831451147@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement shrinking the reserved memory for crash kernel, if it is more
than enough.
For example, if you have already reserved 128M, now you just want 100M,
you can do:
# echo $((100*1024*1024)) > /sys/kernel/kexec_crash_size
Note, you can only do this before loading the crash kernel.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It decreases code size by 16 bytes on my gcc 4.4.1 on Core 2:
text data bss dec hex filename
4314 2216 8 6538 198a kernel/pid.o-BEFORE
4298 2216 8 6522 197a kernel/pid.o-AFTER
Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid calling kfree() under pidmap spinlock, calling it afterwards.
Normally kfree() is fast, but sometimes it can be slow, so avoid
calling it under the spinlock if we can do it.
Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the call to do_signal_stop() down, after tracehook call. This makes
->group_stop_count condition visible to tracers before do_signal_stop()
will participate in this group-stop.
Currently the patch has no effect, tracehook_get_signal() always returns 0.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trivial, s/0/SI_USER/ in collect_signal() for grep.
This is a bit confusing, we don't know the source of this signal.
But we don't care, and "info->si_code = 0" is imho worse.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change send_signal() to use si_fromuser(). From now SEND_SIG_NOINFO
triggers the "from_ancestor_ns" check.
This fixes reparent_thread()->group_send_sig_info(pdeath_signal)
behaviour, before this patch send_signal() does not detect the
cross-namespace case when the child of the dying parent belongs to the
sub-namespace.
This patch can affect the behaviour of send_sig(), kill_pgrp() and
kill_pid() when the caller sends the signal to the sub-namespace with
"priv == 0" but surprisingly all callers seem to use them correctly,
including disassociate_ctty(on_exit).
Except: drivers/staging/comedi/drivers/addi-data/*.c incorrectly use
send_sig(priv => 0). But his is minor and should be fixed anyway.
Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No changes in compiled code. The patch adds the new helper, si_fromuser()
and changes check_kill_permission() to use this helper.
The real effect of this patch is that from now we "officially" consider
SEND_SIG_NOINFO signal as "from user-space" signals. This is already true
if we look at the code which uses SEND_SIG_NOINFO, except __send_signal()
has another opinion - see the next patch.
The naming of these special SEND_SIG_XXX siginfo's is really bad
imho. From __send_signal()'s pov they mean
SEND_SIG_NOINFO from user
SEND_SIG_PRIV from kernel
SEND_SIG_FORCED no info
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the tracee calls fork() after PTRACE_SINGLESTEP, the forked child
starts with TIF_SINGLESTEP/X86_EFLAGS_TF bits copied from ptraced parent.
This is not right, especially when the new child is not auto-attaced: in
this case it is killed by SIGTRAP.
Change copy_process() to call user_disable_single_step(). Tested on x86.
Test-case:
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <assert.h>
int main(void)
{
int pid, status;
if (!(pid = fork())) {
assert(ptrace(PTRACE_TRACEME) == 0);
kill(getpid(), SIGSTOP);
if (!fork()) {
/* kernel bug: this child will be killed by SIGTRAP */
printf("Hello world\n");
return 43;
}
wait(&status);
return WEXITSTATUS(status);
}
for (;;) {
assert(pid == wait(&status));
if (WIFEXITED(status))
break;
assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0);
}
assert(WEXITSTATUS(status) == 43);
return 0;
}
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In massive parallel enviroment, res_counter can be a performance
bottleneck. One strong techinque to reduce lock contention is reducing
calls by coalescing some amount of calls into one.
Considering charge/uncharge chatacteristic,
- charge is done one by one via demand-paging.
- uncharge is done by
- in chunk at munmap, truncate, exit, execve...
- one by one via vmscan/paging.
It seems we have a chance to coalesce uncharges for improving scalability
at unmap/truncation.
This patch is a for coalescing uncharge. For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task. A reason for per-task batching is for making use
of caller's context information. We do batched uncharge (deleyed
uncharge) when truncation/unmap occurs but do direct uncharge when
uncharge is called by memory reclaim (vmscan.c).
The degree of coalescing depends on callers
- at invalidate/trucate... pagevec size
- at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.
On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.
[without memcg config]
40156968 page-faults # 0.085 M/sec ( +- 0.046% )
27.67 cache-miss/faults
[root cgroup]
36659599 page-faults # 0.077 M/sec ( +- 0.247% )
31.58 miss/faults
[in a child cgroup]
18444157 page-faults # 0.039 M/sec ( +- 0.133% )
69.96 miss/faults
[child with this patch]
27133719 page-faults # 0.057 M/sec ( +- 0.155% )
47.16 miss/faults
We can see some amounts of improvement.
(root cgroup doesn't affected by this patch)
Another patch for "charge" will follow this and above will be improved more.
Changelog(since 2009/10/02):
- renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
- some clean up and commentary/description updates.
- added initialize code to copy_process(). (possible bug fix)
Changelog(old):
- fixed !CONFIG_MEM_CGROUP case.
- rebased onto the latest mmotm + softlimit fix patches.
- unified patch for callers
- added commetns.
- make ->do_batch as bool.
- removed css_get() at el. We don't need it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ktime will overflow from 03:14:07 UTC on Tuesday, 19 January 2038,
ktime_add() in timecompare_update() will overflow a half earlier. As a
result, wrong offset will be gotten, then cause some strange problems.
Signed-off-by: Barry Song <21cnbao@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Patrick Ohly <patrick.ohly@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The miss-alignment of bp_addr created a 32bit hole, causing
different structure packings on 32 and 64 bit machines.
Fix that by moving __reserve_2 into that hole.
Further, remove the useless struct and redundant __bp_reserve
muck.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <1260902591.8023.781.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits)
clockevents: Convert to raw_spinlock
clockevents: Make tick_device_lock static
debugobjects: Convert to raw_spinlocks
perf_event: Convert to raw_spinlock
hrtimers: Convert to raw_spinlocks
genirq: Convert irq_desc.lock to raw_spinlock
smp: Convert smplocks to raw_spinlocks
rtmutes: Convert rtmutex.lock to raw_spinlock
sched: Convert pi_lock to raw_spinlock
sched: Convert cpupri lock to raw_spinlock
sched: Convert rt_runtime_lock to raw_spinlock
sched: Convert rq->lock to raw_spinlock
plist: Make plist debugging raw_spinlock aware
bkl: Fixup core_lock fallout
locking: Cleanup the name space completely
locking: Further name space cleanups
alpha: Fix fallout from locking changes
locking: Implement new raw_spinlock
locking: Convert raw_rwlock functions to arch_rwlock
locking: Convert raw_rwlock to arch_rwlock
...
Makes use of skip_spaces() defined in lib/string.c for removing leading
spaces from strings all over the tree.
It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
text data bss dec hex filename
64688 584 592 65864 10148 (TOTALS-BEFORE)
64641 584 592 65817 10119 (TOTALS-AFTER)
Also, while at it, if we see (*str && isspace(*str)), we can be sure to
remove the first condition (*str) as the second one (isspace(*str)) also
evaluates to 0 whenever *str == 0, making it redundant. In other words,
"a char equals zero is never a space".
Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
and found occurrences of this pattern on 3 more files:
drivers/leds/led-class.c
drivers/leds/ledtrig-timer.c
drivers/video/output.c
@@
expression str;
@@
( // ignore skip_spaces cases
while (*str && isspace(*str)) { \(str++;\|++str;\) }
|
- *str &&
isspace(*str)
)
Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel offers with TIOCL_GETKMSGREDIRECT ioctl() the possibility to
redirect the kernel messages to a specific console.
However, since it's not possible to switch to the kernel message console
after a panic(), it would be nice if the kernel would print the panic
message on the current console.
This patch series adds a new interface to access the global kmsg_redirect
variable by a function to be able to use it in code where
CONFIG_VT_CONSOLE is not set (kernel/panic.c).
This patch:
Instead of using and exporting a global value kmsg_redirect, introduce a
function vt_kmsg_redirect() that both can set and return the console where
messages are printed.
Change all users of kmsg_redirect (the VT code itself and kernel/power.c)
to the new interface.
The main advantage is that vt_kmsg_redirect() can also be used when
CONFIG_VT_CONSOLE is not set.
Signed-off-by: Bernhard Walle <bernhard@bwalle.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_each_thread/while_each_thread wrap a block of code that is in this format:
for (...)
do
...
while
If curly braces do not surround the inner loop the following warning is
generated by sparse:
warning: do-while statement is not a compound statement
Fix the warning by adding the braces.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use smp_processor_id() instead of get_cpu() and put_cpu() in
generic_smp_call_function_interrupt(), It's no need to disable preempt,
because we must call generic_smp_call_function_interrupt() with interrupts
disabled.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Engelhardt reported we have this problem:
setting max_map_count to a value large enough results in programs dying at
first try. This is on 2.6.31.6:
15:59 borg:/proc/sys/vm # echo $[1<<31-1] >max_map_count
15:59 borg:/proc/sys/vm # cat max_map_count
1073741824
15:59 borg:/proc/sys/vm # echo $[1<<31] >max_map_count
15:59 borg:/proc/sys/vm # cat max_map_count
Killed
This is because we have a chance to make 'max_map_count' negative. but
it's meaningless. Make it only accept non-negative values.
Reported-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: James Morris <jmorris@namei.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit d8e180dcd5 "bsdacct: switch
credentials for writing to the accounting file" introduced credential
switching during final acct data collecting. However, uid/gid pair
continued to be collected from current which became credentials of who
created acct file, not who exits.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14676
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: Juho K. Juopperi <jkj@kapsi.fi>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is quite legitimate for CPUs to be numbered sparsely, meaning
that it possible for an online CPU to have a number which is
greater than the total count of possible CPUs.
Currently find_get_context() has a sanity check on the cpu
number where it checks it against num_possible_cpus(). This
test can fail for a legitimate cpu number if the
cpu_possible_mask is sparsely populated.
This fixes the problem by checking the CPU number against
nr_cpumask_bits instead, since that is the appropriate check to
ensure that the cpu number is same to pass to cpu_isset()
subsequently.
Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Michael Neuling <mikey@neuling.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
LKML-Reference: <20091215084032.GA18661@brick.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Relax stable-sched-clock architectures to not save/disable/restore
hardirqs in cpu_clock().
The background is that I was trying to resolve a sparc64 perf
issue when I discovered this problem.
On sparc64 I implement pseudo NMIs by simply running the kernel
at IRQ level 14 when local_irq_disable() is called, this allows
performance counter events to still come in at IRQ level 15.
This doesn't work if any code in an NMI handler does
local_irq_save() or local_irq_disable() since the "disable" will
kick us back to cpu IRQ level 14 thus letting NMIs back in and
we recurse.
The only path which that does that in the perf event IRQ
handling path is the code supporting frequency based events. It
uses cpu_clock().
cpu_clock() simply invokes sched_clock() with IRQs disabled.
And that's a fundamental bug all on it's own, particularly for
the HAVE_UNSTABLE_SCHED_CLOCK case. NMIs can thus get into the
sched_clock() code interrupting the local IRQ disable code
sections of it.
Furthermore, for the not-HAVE_UNSTABLE_SCHED_CLOCK case, the IRQ
disabling done by cpu_clock() is just pure overhead and
completely unnecessary.
So the core problem is that sched_clock() is not NMI safe, but
we are invoking it from NMI contexts in the perf events code
(via cpu_clock()).
A less important issue is the overhead of IRQ disabling when it
isn't necessary in cpu_clock().
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK architectures are not
affected by this patch.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091213.182502.215092085.davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The trace_dump_stack() returned a value for a void function.
Also, added the missing stub for trace_dump_stack() when tracing is
not configured.
Reported-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20091214162713.GA31060@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
powerpc applies relocations to the kcrctab. They're absolute symbols,
but it's not completely unreasonable: other archs may too, but the
relocation is often 0.
http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-November/077972.html
Inspired-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
plists are used with spinlocks and raw_spinlocks. Change the plist
debugging to handle both types.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Make the name space hierarchy of locking functions consistent:
raw_spin* -> _raw_spin* -> __raw_spin*
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
The name space hierarchy for the internal lock functions is now a bit
backwards. raw_spin* functions map to _spin* which use __spin*, while
we would like to have _raw_spin* and __raw_spin*.
_raw_spin* is already used by lock debugging, so rename those funtions
to do_raw_spin* to free up the _raw_spin* name space.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Now that the raw_spin name space is freed up, we can implement
raw_spinlock and the related functions which are used to annotate the
locks which are not converted to sleeping spinlocks in preempt-rt.
A side effect is that only such locks can be used with the low level
lock fsunctions which circumvent lockdep.
For !rt spin_* functions are mapped to the raw_spin* implementations.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Name space cleanup. No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: linux-arch@vger.kernel.org
Further name space cleanup. No functional change
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: linux-arch@vger.kernel.org
The raw_spin* namespace was taken by lockdep for the architecture
specific implementations. raw_spin_* would be the ideal name space for
the spinlocks which are not converted to sleeping locks in preempt-rt.
Linus suggested to convert the raw_ to arch_ locks and cleanup the
name space instead of using an artifical name like core_spin,
atomic_spin or whatever
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: linux-arch@vger.kernel.org
Separate spin_lock and rw_lock functions. Preempt-RT needs to exclude
the rw_lock functions from being compiled. The reordering allows to do
that with a single #ifdef.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
m68k: rename global variable vmalloc_end to m68k_vmalloc_end
percpu: add missing per_cpu_ptr_to_phys() definition for UP
percpu: Fix kdump failure if booted with percpu_alloc=page
percpu: make misc percpu symbols unique
percpu: make percpu symbols in ia64 unique
percpu: make percpu symbols in powerpc unique
percpu: make percpu symbols in x86 unique
percpu: make percpu symbols in xen unique
percpu: make percpu symbols in cpufreq unique
percpu: make percpu symbols in oprofile unique
percpu: make percpu symbols in tracer unique
percpu: make percpu symbols under kernel/ and mm/ unique
percpu: remove some sparse warnings
percpu: make alloc_percpu() handle array types
vmalloc: fix use of non-existent percpu variable in put_cpu_var()
this_cpu: Use this_cpu_xx in trace_functions_graph.c
this_cpu: Use this_cpu_xx for ftrace
this_cpu: Use this_cpu_xx in nmi handling
this_cpu: Use this_cpu operations in RCU
this_cpu: Use this_cpu ops for VM statistics
...
Fix up trivial (famous last words) global per-cpu naming conflicts in
arch/x86/kvm/svm.c
mm/slab.c
read_lock(&tasklist_lock) does not protect
sys_sched_get_rr_param() against a concurrent update of the
policy or scheduler parameters as do_sched_scheduler() does not
take the tasklist_lock.
The access to task->sched_class->get_rr_interval is protected by
task_rq_lock(task).
Use rcu_read_lock() to protect find_task_by_vpid() and prevent
the task struct from going away.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091209100706.862897167@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tasklist_lock is held read locked to protect the
find_task_by_vpid() call and to prevent the task going away.
sched_setaffinity acquires a task struct ref and drops tasklist
lock right away. The access to the cpus_allowed mask is
protected by rq->lock.
rcu_read_lock() provides the same protection here.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091209100706.789059966@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
read_lock(&tasklist_lock) does not protect
sys_sched_getscheduler and sys_sched_getparam() against a
concurrent update of the policy or scheduler parameters as
do_sched_setscheduler() does not take the tasklist_lock. The
accessed integers can be retrieved w/o locking and are snapshots
anyway.
Using rcu_read_lock() to protect find_task_by_vpid() and prevent
the task struct from going away is not changing the above
situation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091209100706.753790977@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix this warning:
kernel/trace/trace_ksym.c: In function 'ksym_trace_filter_read':
kernel/trace/trace_ksym.c:239: warning: cast to pointer from integer of different size
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: "K.Prasad" <prasad@linux.vnet.ibm.com>
LKML-Reference: <4B1DC578.9020909@cn.fujitsu.com>
[remove the strstrip fix as tglx already fixed that]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
trace_power_start and trace_power_end are used in
arch/x86/kernel/power.c, and this file can't be compiled
as a module, so these two tracepoints don't need to be
exported.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC55F.7060305@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Like total_profile_count, struct ftrace_event_call::profile_count
is protected by event_mutex, so it doesn't need to be atomic_t.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <4B1DC549.5010705@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
The buffer for the output is as small as 64 bytes, so it'll
overflow if we add more clock type. Use seq file instead.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC4FB.5030407@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
# echo 'do_open' > set_graph_function
# echo 'do_open' >> set_graph_function
bash: echo: write error: Invalid argument
Make it valid to write the same value to set_graph_function,
which is consistent with set_ftrace_filter interface.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-reference: <4B1DC4E1.1060303@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
I found a weird behavior:
# echo 'fuse:*' > set_ftrace_filter
bash: echo: write error: Invalid argument
# cat set_ftrace_filter
fuse_dev_fasync
fuse_dev_poll
fuse_copy_do
We should call trace_parser_clear() no matter ftrace_process_regex()
returns 0 or -errno, otherwise we will actually take the unaccepted
records from ftrace_regex_release().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC4D2.3000406@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Currently it doesn't warn user on invald value:
# echo nonexist_symbol > set_ftrace_filter
or:
# echo 'nonexist_symbol:mod:fuse' > set_ftrace_filter
Better make it return failure.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC4BF.2070003@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Move the printk from each ftrace_raw_reg_event_foo() to
its caller ftrace_event_enable_disable(). This avoids each
regfunc trace event callbacks to handle a same error report
that can be carried from the caller.
See how much space this saves:
text data bss dec hex filename
5345151 1961864 7103260 14410275 dbe223 vmlinux.o.old
5331487 1961864 7103260 14396611 dbacc3 vmlinux.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <4B1DC4AC.802@cn.fujitsu.com>
[start cmdline record before calling regfunc to avoid lost
window of pid to comm resolution]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Call trace_define_common_fields() in event_create_dir() only.
This avoids trace events to handle it from their define_fields
callbacks and shrinks the kernel code size:
text data bss dec hex filename
5346802 1961864 7103260 14411926 dbe896 vmlinux.o.old
5345151 1961864 7103260 14410275 dbe223 vmlinux.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <4B1DC49C.8000107@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Use a generic trace_event_raw_init() function for all event's raw_init
callbacks (but kprobes) instead of defining the same version for each
of these.
This shrinks the kernel code:
text data bss dec hex filename
5355293 1961928 7103260 14420481 dc0a01 vmlinux.o.old
5346802 1961864 7103260 14411926 dbe896 vmlinux.o
raw_init can't be removed, because ftrace events and kprobe events
use different raw_init callbacks. Though it's possible to totally
remove raw_init, I choose to leave it as it is for now.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <4B1DC48C.7080603@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Alan Stern noticed that all the wakeup side (and atomic) variants of the
completion APIs should be irq safe, but the newly introduced
completion_done() and try_wait_for_completion() aren't. The use of the
irq unsafe variants in IRQ contexts can cause crashes/hangs.
Fix the problem by making them use spin_lock_irqsave() and
spin_lock_irqrestore().
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: pm list <linux-pm@lists.linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Chinner <david@fromorbit.com>
Cc: Lachlan McIlroy <lachlan@sgi.com>
LKML-Reference: <200912130007.30541.rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
itimer: Fix the itimer trace print format
hrtimer: move timer stats helper functions to hrtimer.c
hrtimer: Tune hrtimer_interrupt hang logic
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: Avoid out of bounds array reference in save_trace()
futex: Take mmap_sem for get_user_pages in fault_in_user_writeable
lockstat: Add usage info to Documentation/lockstat.txt
lockstat: Fix min, max times in /proc/lock_stats
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Remove comparing of NULL to va_list in trace_array_vprintk()
tracing: Fix function graph trace_pipe to properly display failed entries
tracing: Add full state to trace_seq
tracing: Buffer the output of seq_file in case of filled buffer
tracing: Only call pipe_close if pipe_close is defined
tracing: Add pipe_close interface
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (57 commits)
x86, perf events: Check if we have APIC enabled
perf_event: Fix variable initialization in other codepaths
perf kmem: Fix unused argument build warning
perf symbols: perf_header__read_build_ids() offset'n'size should be u64
perf symbols: dsos__read_build_ids() should read both user and kernel buildids
perf tools: Align long options which have no short forms
perf kmem: Show usage if no option is specified
sched: Mark sched_clock() as notrace
perf sched: Add max delay time snapshot
perf tools: Correct size given to memset
perf_event: Fix perf_swevent_hrtimer() variable initialization
perf sched: Fix for getting task's execution time
tracing/kprobes: Fix field creation's bad error handling
perf_event: Cleanup for cpu_clock_perf_event_update()
perf_event: Allocate children's perf_event_ctxp at the right time
perf_event: Clean up __perf_event_init_context()
hw-breakpoints: Modify breakpoints without unregistering them
perf probe: Update perf-probe document
perf probe: Support --del option
trace-kprobe: Support delete probe syntax
...
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (58 commits)
tty: split the lock up a bit further
tty: Move the leader test in disassociate
tty: Push the bkl down a bit in the hangup code
tty: Push the lock down further into the ldisc code
tty: push the BKL down into the handlers a bit
tty: moxa: split open lock
tty: moxa: Kill the use of lock_kernel
tty: moxa: Fix modem op locking
tty: moxa: Kill off the throttle method
tty: moxa: Locking clean up
tty: moxa: rework the locking a bit
tty: moxa: Use more tty_port ops
tty: isicom: fix deadlock on shutdown
tty: mxser: Use the new locking rules to fix setserial properly
tty: mxser: use the tty_port_open method
tty: isicom: sort out the board init logic
tty: isicom: switch to the new tty_port_open helper
tty: tty_port: Add a kref object to the tty port
tty: istallion: tty port open/close methods
tty: stallion: Convert to the tty_port_open/close methods
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kgdb: Always process the whole breakpoint list on activate or deactivate
kgdb: continue and warn on signal passing from gdb
kgdb,x86: do not set kgdb_single_step on x86
kgdb: allow for cpu switch when single stepping
kgdb,i386: Fix corner case access to ss with NMI watch dog exception
kgdb: Replace strstr() by strchr() for single-character needles
kgdbts: Read buffer overflow
kgdb: Read buffer overflow
kgdb,x86: remove redundant test
There are two call points, both want to check that tty->signal->leader is
set. Move the test into disassociate_ctty() as that will make locking
changes easier in a bit
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The irqsoff and friends tracers help in finding causes of latency in the
kernel. The also work with the function tracer to show what was happening
when interrupts or preemption are disabled. But the function tracer has
a bit of an overhead and can cause exagerated readings.
Currently, when tracing with /proc/sys/kernel/ftrace_enabled = 0, where the
function tracer is disabled, the information that is provided can end up
being useless. For example, a 2 and a half millisecond latency only showed:
# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 2.6.32
# --------------------------------------------------------------------
# latency: 2463 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: -4242 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: _spin_lock_irqsave
# => ended at: remove_wait_queue
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| /_--=> lock-depth
# |||||/ delay
# cmd pid |||||| time | caller
# \ / |||||| \ | /
hackbenc-4242 2d.... 0us!: trace_hardirqs_off <-_spin_lock_irqsave
hackbenc-4242 2...1. 2463us+: _spin_unlock_irqrestore <-remove_wait_queue
hackbenc-4242 2...1. 2466us : trace_preempt_on <-remove_wait_queue
The above lets us know that hackbench with pid 2463 grabbed a spin lock
somewhere and enabled preemption at remove_wait_queue. This helps a little
but where this actually happened is not informative.
This patch adds the stack dump to the end of the irqsoff tracer. This provides
the following output:
hackbenc-4242 2d.... 0us!: trace_hardirqs_off <-_spin_lock_irqsave
hackbenc-4242 2...1. 2463us+: _spin_unlock_irqrestore <-remove_wait_queue
hackbenc-4242 2...1. 2466us : trace_preempt_on <-remove_wait_queue
hackbenc-4242 2...1. 2467us : <stack trace>
=> sub_preempt_count
=> _spin_unlock_irqrestore
=> remove_wait_queue
=> free_poll_entry
=> poll_freewait
=> do_sys_poll
=> sys_poll
=> system_call_fastpath
Now we see that the culprit of this latency was the free_poll_entry code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I've been asked a few times about how to find out what is calling
some location in the kernel. One way is to use dynamic function tracing
and implement the func_stack_trace. But this only finds out who is
calling a particular function. It does not tell you who is calling
that function and entering a specific if conditional.
I have myself implemented a quick version of trace_dump_stack() for
this purpose a few times, and just needed it now. This is when I realized
that this would be a good tool to have in the kernel like trace_printk().
Using trace_dump_stack() is similar to dump_stack() except that it
writes to the trace buffer instead and can be used in critical locations.
For example:
@@ -5485,8 +5485,12 @@ need_resched_nonpreemptible:
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
- else
+ else {
deactivate_task(rq, prev, 1);
+ trace_printk("Deactivating task %s:%d\n",
+ prev->comm, prev->pid);
+ trace_dump_stack();
+ }
switch_count = &prev->nvcsw;
}
Produces:
<...>-3249 [001] 296.105269: schedule: Deactivating task ntpd:3249
<...>-3249 [001] 296.105270: <stack trace>
=> schedule
=> schedule_hrtimeout_range
=> poll_schedule_timeout
=> do_select
=> core_sys_select
=> sys_select
=> system_call_fastpath
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch fixes 2 edge cases in using kgdb in conjunction with gdb.
1) kgdb_deactivate_sw_breakpoints() should process the entire array of
breakpoints. The failure to do so results in breakpoints that you
cannot remove, because a break point can only be removed if its
state flag is set to BP_SET.
The easy way to duplicate this problem is to plant a break point in
a kernel module and then unload the kernel module.
2) kgdb_activate_sw_breakpoints() should process the entire array of
breakpoints. The failure to do so results in missed breakpoints
when a breakpoint cannot be activated.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
On some architectures for the segv trap, gdb wants to pass the signal
back on continue. For kgdb this is not the default behavior, because
it can cause the kernel to crash if you arbitrarily pass back a
exception outside of kgdb.
Instead of causing instability, pass a message back to gdb about the
supported kgdb signal passing and execute a standard kgdb continue
operation.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The kgdb core should not assume that a single step operation of a
kernel thread will complete on the same CPU. The single step flag is
set at the "thread" level and it is possible in a multi cpu system
that a kernel thread can get scheduled on another cpu the next time it
is run.
As a further safety net in case a slave cpu is hung, the debug master
cpu will try 100 times before giving up and assuming control of the
slave cpus is no longer possible. It is more useful to be able to get
some information out of kgdb instead of spinning forever.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Roel Kluin reported an error found with Parfait. Where we want to
ensure that that kgdb_info[-1] never gets accessed.
Also check to ensure any negative tid does not exceed the size of the
shadow CPU array, else report critical debug context because it is an
internal kgdb failure.
Reported-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Xiaotian Feng triggered a list corruption in the clock events list on
CPU hotplug and debugged the root cause.
If a CPU registers more than one per cpu clock event device, then only
the active clock event device is removed on CPU_DEAD. The unused
devices are kept in the clock events device list.
On CPU up the clock event devices are registered again, which means
that we list_add an already enqueued list_head. That results in list
corruption.
Resolve this by removing all devices which are associated to the dead
CPU on CPU_DEAD.
Reported-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Xiaotian Feng <dfeng@redhat.com>
Cc: stable@kernel.org
While using an application that does splice on the ftrace ring
buffer at start up, I triggered an integrity check failure.
Looking into this, I discovered that resizing the buffer performs
an integrity check after the buffer is resized. This check unfortunately
is preformed after it releases the reader lock. If a reader is
reading the buffer it may cause the integrity check to trigger a
false failure.
This patch simply moves the integrity checker under the protection
of the ring buffer reader lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There was a comment in the ring buffer code that says the calling
layers should prevent tracing or reading of the ring buffer while
resizing. I have discovered that the tracers do not honor this
arrangement.
This patch moves the disabling and synchronizing the ring buffer to
a higher layer during resizing. This guarantees that no writes
are occurring while the resize takes place.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
strstrip returns a pointer to the first non space character, but the
code in parse_ksym_trace_str() ignores that.
strstrip is now must_check and therefor we get the correct warning:
kernel/trace/trace_ksym.c:294: warning:
ignoring return value of ‘strstrip’, declared with attribute warn_unused_result
We are really not interested in leading whitespace here.
Fix that and cleanup the dozen kfree() exit pathes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
commit c69e8d9 (CRED: Use RCU to access another task's creds and to
release a task's own creds) added non rcu_read_lock() protected access
to task creds of the target task in set_prio_one().
The comment above the function says:
* - the caller must hold the RCU read lock
The calling code in sys_setpriority does read_lock(&tasklist_lock) but
not rcu_read_lock(). This works only when CONFIG_TREE_PREEMPT_RCU=n.
With CONFIG_TREE_PREEMPT_RCU=y the rcu_callbacks can run in the tick
interrupt when they see no read side critical section.
There is another instance of __task_cred() in sys_setpriority() itself
which is equally unprotected.
Wrap the whole code section into a rcu read side critical section to
fix this quick and dirty.
Will be revisited in course of the read_lock(&tasklist_lock) -> rcu
crusade.
Oleg noted further:
This also fixes another bug here. find_task_by_vpid() is not safe
without rcu_read_lock(). I do not mean it is not safe to use the
result, just find_pid_ns() by itself is not safe.
Usually tasklist gives enough protection, but if copy_process() fails
it calls free_pid() lockless and does call_rcu(delayed_put_pid().
This means, without rcu lock find_pid_ns() can't scan the hash table
safely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091210004703.029784964@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
1) Remove the misleading comment in __sigqueue_alloc() which claims
that holding a spinlock is equivalent to rcu_read_lock().
2) Add a rcu_read_lock/unlock around the __task_cred() access
in __sigqueue_alloc()
This needs to be revisited to remove the remaining users of
read_lock(&tasklist_lock) but that's outside the scope of this patch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091210004703.269843657@linutronix.de>
kill_pid_info_as_uid() accesses __task_cred() without being in a RCU
read side critical section. tasklist_lock is not protecting that when
CONFIG_TREE_PREEMPT_RCU=y.
Convert the whole tasklist_lock section to rcu and use
lock_task_sighand to prevent the exit race.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091210004703.232302055@linutronix.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
This build warning:
kernel/sched.c: In function 'set_task_cpu':
kernel/sched.c:2070: warning: unused variable 'old_rq'
Made me realize that the forced2_migrations stat looks pretty
pointless (and a misnomer) - remove it.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If the second in each of these pairs of allocations fails, then the
first one will not be freed in the error route out.
Found by a static code analysis tool.
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1260448177-28448-1-git-send-email-ext-phil.2.carmody@nokia.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no reason to make timer_stats_hrtimer_set_start_info and
friends visible to the rest of the kernel. So move all of them to
hrtimer.c. Also make timer_stats_hrtimer_set_start_info a static
inline function so it gets inlined and we avoid another function call.
Based on a patch by Thomas Gleixner.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <20091210095629.GC4144@osiris.boeblingen.de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The hrtimer_interrupt hang logic adjusts min_delta_ns based on the
execution time of the hrtimer callbacks.
This is error-prone for virtual machines, where a guest vcpu can be
scheduled out during the execution of the callbacks (and the callbacks
themselves can do operations that translate to blocking operations in
the hypervisor), which in can lead to large min_delta_ns rendering the
system unusable.
Replace the current heuristics with something more reliable. Allow the
interrupt code to try 3 times to catch up with the lost time. If that
fails use the total time spent in the interrupt handler to defer the
next timer interrupt so the system can catch up with other things
which got delayed. Limit that deferment to 100ms.
The retry events and the maximum time spent in the interrupt handler
are recorded and exposed via /proc/timer_list
Inspired by a patch from Marcelo.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
ia64 found this the hard way (because we currently have a stub
for save_stack_trace() that does nothing). But it would be a
good idea to be cautious in case a real save_stack_trace()
bailed out with an error before it set trace->nr_entries.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: luming.yu@intel.com
LKML-Reference: <4b2024d085302c2a2@agluck-desktop.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cap_get_target_pid() protects the task lookup with tasklist_lock.
security_capget() is called under tasklist_lock as well but
tasklist_lock does not protect anything there. The capabilities are
protected by RCU already.
So tasklist_lock only protects the lookup and prevents the task going
away, which can be done with rcu_read_lock() as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: James Morris <jmorris@namei.org>
Olof Johansson stated the following:
Comparing a va_list with NULL is bogus. It's supposed to be treated like
an opaque type and only be manipulated with va_* accessors.
Olof noticed that this code broke the ARM builds:
kernel/trace/trace.c: In function 'trace_array_vprintk':
kernel/trace/trace.c:1364: error: invalid operands to binary == (have 'va_list' and 'void *')
kernel/trace/trace.c: In function 'tracing_mark_write':
kernel/trace/trace.c:3349: error: incompatible type for argument 3 of 'trace_vprintk'
This patch partly reverts c13d2f7c32 and
re-installs the original mark_printk() mechanism.
Reported-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
LKML-Reference: <4B1BAB74.104@osadl.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is a case where the graph tracer might get confused and omits
displaying of a single record. This applies mostly with the trace_pipe
since it is unlikely that the trace_seq buffer will overflow with the
trace file.
As the function_graph tracer goes through the trace entries keeping a
pointer to the current record:
current -> func1 ENTRY
func2 ENTRY
func2 RETURN
func1 RETURN
When an function ENTRY is encountered, it moves the pointer to the
next entry to check if the function is a nested or leaf function.
func1 ENTRY
current -> func2 ENTRY
func2 RETURN
func1 RETURN
If the rest of the writing of the function fills the trace_seq buffer,
then the trace_pipe read will ignore this entry. The next read will
Now start at the current location, but the first entry (func1) will
be discarded.
This patch keeps a copy of the current entry in the iterator private
storage and will keep track of when the trace_seq buffer fills. When
the trace_seq buffer fills, it will reuse the copy of the entry in the
next iteration.
[
This patch has been largely modified by Steven Rostedt in order to
clean it up and simplify it. The original idea and concept was from
Jirka and for that, this patch will go under his name to give him
the credit he deserves. But because this was modify by Steven Rostedt
anything wrong with the patch should be blamed on Steven.
]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1259067458-27143-1-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_seq buffer might fill up, and right now one needs to check the
return value of each printf into the buffer to check for that.
Instead, have the buffer keep track of whether it is full or not, and
reject more input if it is full or would have overflowed with an input
that wasn't added.
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the seq_read fills the buffer it will call s_start again on the next
itertation with the same position. This causes a problem with the
function_graph tracer because it consumes the iteration in order to
determine leaf functions.
What happens is that the iterator stores the entry, and the function
graph plugin will look at the next entry. If that next entry is a return
of the same function and task, then the function is a leaf and the
function_graph plugin calls ring_buffer_read which moves the ring buffer
iterator forward (the trace iterator still points to the function start
entry).
The copying of the trace_seq to the seq_file buffer will fail if the
seq_file buffer is full. The seq_read will not show this entry.
The next read by userspace will cause seq_read to again call s_start
which will reuse the trace iterator entry (the function start entry).
But the function return entry was already consumed. The function graph
plugin will think that this entry is a nested function and not a leaf.
To solve this, the trace code now checks the return status of the
seq_printf (trace_print_seq). If the writing to the seq_file buffer
fails, we set a flag in the iterator (leftover) and we do not reset
the trace_seq buffer. On the next call to s_start, we check the leftover
flag, and if it is set, we just reuse the trace_seq buffer and do not
call into the plugin print functions.
Before this patch:
2) | fput() {
2) | __fput() {
2) 0.550 us | inotify_inode_queue_event();
2) | __fsnotify_parent() {
2) 0.540 us | inotify_dentry_parent_queue_event();
After the patch:
2) | fput() {
2) | __fput() {
2) 0.550 us | inotify_inode_queue_event();
2) 0.548 us | __fsnotify_parent();
2) 0.540 us | inotify_dentry_parent_queue_event();
[
Updated the patch to fix a missing return 0 from the trace_print_seq()
stub when CONFIG_TRACING is disabled.
Reported-by: Ingo Molnar <mingo@elte.hu>
]
Reported-by: Jiri Olsa <jolsa@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This fixes a cut and paste error that had pipe_close get called
if pipe_open was defined (not pipe_close).
Reported-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20091209153204.F4CD.A69D9226@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sys: Remove BKL from sys_reboot
pm_qos: clean up racy global "name" variable
pm_qos: remove BKL
find_task_by_vpid() in compat_sys_get_robust_list() does not require
tasklist_lock. It can be protected with rcu_read_lock as done in
sys_get_robust_list() already.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
When we define the common event fields in kprobe, we invert the error
handling and return immediately in case of success. Then we omit
to define specific kprobes fields (ip and nargs), and specific
kretprobes fields (func, ret_ip, nargs). And we only define them
when we fail to create common fields.
The most visible consequence is that we can't create filter for
k(ret)probes specific fields.
This patch re-invert the success/error handling to fix it.
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1260263815-5167-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The normalized values are also recalculated in case the scaling factor
changes.
This patch updates the internally used scheduler tuning values that are
normalized to one cpu in case a user sets new values via sysfs.
Together with patch 2 of this series this allows to let user configured
values scale (or not) to cpu add/remove events taking place later.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259579808-11357-4-git-send-email-ehrhardt@linux.vnet.ibm.com>
[ v2: fix warning ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As scaling now takes place on all kind of cpu add/remove events a user
that configures values via proc should be able to configure if his set
values are still rescaled or kept whatever happens.
As the comments state that log2 was just a second guess that worked the
interface is not just designed for on/off, but to choose a scaling type.
Currently this allows none, log and linear, but more important it allwos
us to keep the interface even if someone has an even better idea how to
scale the values.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259579808-11357-3-git-send-email-ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Based on Peter Zijlstras patch suggestion this enables recalculation of
the scheduler tunables in response of a change in the number of cpus. It
also adds a max of eight cpus that are considered in that scaling.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259579808-11357-2-git-send-email-ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
83f9ac removed a call to effective_prio() in wake_up_new_task(), which
leads to tasks running at MAX_PRIO.
This is caused by the idle thread being set to MAX_PRIO before forking
off init. O(1) used that to make sure idle was always preempted, CFS
uses check_preempt_curr_idle() for that so we can savely remove this bit
of legacy code.
Reported-by: Mike Galbraith <efault@gmx.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259754383.4003.610.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When setting the weight for a per-cpu task-group, we have to put in a
phantom weight when there is no work on that cpu, otherwise we'll not
service that cpu when new work gets placed there until we again update
the per-cpu weights.
We used to add these phantom weights to the total, so that the idle
per-cpu shares don't get inflated, this however causes the non-idle
parts to get deflated, causing unexpected weight distibutions.
Reverse this, so that the non-idle shares are correct but the idle
shares are inflated.
Reported-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Tested-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1257934048.23203.76.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As Nick pointed out, and realized by myself when doing:
sched: Fix balance vs hotplug race
the patch:
sched: for_each_domain() vs RCU
is wrong, sched_domains are freed after synchronize_sched(), which
means disabling preemption is enough.
Reported-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WAKEUP_RUNNING was an experiment, not sure why that ever ended up being
merged...
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Streamline the wakeup preemption code a bit, unifying the preempt path
so that they all do the same.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If a RT task is woken up while a non-RT task is running,
check_preempt_wakeup() is called to check whether the new task can
preempt the old task. The function returns quickly without going deeper
because it is apparent that a RT task can always preempt a non-RT task.
In this situation, check_preempt_wakeup() always calls update_curr() to
update vruntime value of the currently running task. However, the
function call is unnecessary and redundant at that moment because (1) a
non-RT task can always be preempted by a RT task regardless of its
vruntime value, and (2) update_curr() will be called shortly when the
context switch between two occurs.
By moving update_curr() in check_preempt_wakeup(), we can avoid
redundant call to update_curr(), slightly reducing the time taken to
wake up RT tasks.
Signed-off-by: Jupyung Lee <jupyung@gmail.com>
[ Place update_curr() right before the wake_preempt_entity() call, which
is the only thing that relies on the updated vruntime ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1258451500-6714-1-git-send-email-jupyung@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently we try to do task placement in wake_up_new_task() after we do
the load-balance pass in sched_fork(). This yields complicated semantics
in that we have to deal with tasks on different RQs and the
set_task_cpu() calls in copy_process() and sched_fork()
Rename ->task_new() to ->task_fork() and call it from sched_fork()
before the balancing, this gives the policy a clear point to place the
task.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since set_task_clock() doesn't rely on rq->clock anymore we can simplyfy
the mess in ttwu().
Optimize things a bit by not fiddling with the IRQ state there.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
set_task_cpu() should be rq invariant and only touch task state, it
currently fails to do so, which opens up a few races, since not all
callers hold both rq->locks.
Remove the relyance on rq->clock, as any site calling set_task_cpu()
should also do a remote clock update, which should ensure the observed
time between these two cpus is monotonic, as per
kernel/sched_clock.c:sched_clock_remote().
Therefore we can simply remove the clock_offset bits and be happy.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since we've had a much saner debugfs interface to this, remove the
sysctl one.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
[ v2: build fix ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
sched_rr_get_param calls
task->sched_class->get_rr_interval(task) without protection
against a concurrent sched_setscheduler() call which modifies
task->sched_class.
Serialize the access with task_rq_lock(task) and hand the rq
pointer into get_rr_interval() as it's needed at least in the
sched_fair implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <alpine.LFD.2.00.0912090930120.3089@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
sched_getaffinity() is not protected against a concurrent
modification of the tasks affinity.
Serialize the access with task_rq_lock(task).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091208202026.769251187@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In current code, children task will allocate memory for
'child->perf_event_ctxp' if the parent is counted, we can
do it only if the parent allowed children inherit it.
It can save memory and reduce overhead.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B1F19A8.5040805@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean up the code a bit:
- define 'perf_cpu_context' variable with 'static'
- use kzalloc() instead of kmalloc() and memset()
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B1F194D.7080306@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, when ptrace needs to modify a breakpoint, like disabling
it, changing its address, type or len, it calls
modify_user_hw_breakpoint(). This latter will perform the heavy and
racy task of unregistering the old breakpoint and registering a new
one.
This is racy as someone else might steal the reserved breakpoint
slot under us, which is undesired as the breakpoint is only
supposed to be modified, sometimes in the middle of a debugging
workflow. We don't want our slot to be stolen in the middle.
So instead of unregistering/registering the breakpoint, just
disable it while we modify its breakpoint fields and re-enable it
after if necessary.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <1260347148-5519-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers-for-linus-ntp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ntp: Provide compability defines (You say MOD_NANO, I say ADJ_NANO)
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: do not execute DEBUG_SHIRQ when irq setup failed
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers, init: Limit the number of per cpu calibration bootup messages
posix-cpu-timers: optimize and document timer_create callback
clockevents: Add missing include to pacify sparse
x86: vmiclock: Fix printk format
x86: Fix printk format due to variable type change
sparc: fix printk for change of variable type
clocksource/events: Fix fallout of generic code changes
nohz: Allow 32-bit machines to sleep for more than 2.15 seconds
nohz: Track last do_timer() cpu
nohz: Prevent clocksource wrapping during idle
nohz: Type cast printk argument
mips: Use generic mult/shift factor calculation for clocks
clocksource: Provide a generic mult/shift factor calculation
clockevents: Use u32 for mult and shift factors
nohz: Introduce arch_needs_cpu
nohz: Reuse ktime in sub-functions of tick_check_idle.
time: Remove xtime_cache
time: Implement logarithmic time accumulation
* 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
cfq-iosched: Do not access cfqq after freeing it
block: include linux/err.h to use ERR_PTR
cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
blkio: Allow CFQ group IO scheduling even when CFQ is a module
blkio: Implement dynamic io controlling policy registration
blkio: Export some symbols from blkio as its user CFQ can be a module
block: Fix io_context leak after failure of clone with CLONE_IO
block: Fix io_context leak after clone with CLONE_IO
cfq-iosched: make nonrot check logic consistent
io controller: quick fix for blk-cgroup and modular CFQ
cfq-iosched: move IO controller declerations to a header file
cfq-iosched: fix compile problem with !CONFIG_CGROUP
blkio: Documentation
blkio: Wait on sync-noidle queue even if rq_noidle = 1
blkio: Implement group_isolation tunable
blkio: Determine async workload length based on total number of queues
blkio: Wait for cfq queue to get backlogged if group is empty
blkio: Propagate cgroup weight updation to cfq groups
blkio: Drop the reference to queue once the task changes cgroup
blkio: Provide some isolation between groups
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM: Add flag for devices capable of generating run-time wake-up events
PM / Runtime: Remove unnecessary braces in __pm_runtime_set_status()
PM / Runtime: Make documentation of runtime_idle() agree with the code
PM / Runtime: Ensure timer_expires is nonzero in pm_schedule_suspend()
PM / Runtime: Use deferred_resume flag in pm_request_resume
PM / Runtime: Export the PM runtime workqueue
PM / Runtime: Fix lockdep warning in __pm_runtime_set_status()
PM / Hibernate: Swap, use KERN_CONT
PM / Hibernate: Shift remaining code from swsusp.c to hibernate.c
PM / Hibernate: Move swap functions to kernel/power/swap.c.
PM / freezer: Don't get over-anxious while waiting
* 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (84 commits)
KVM: VMX: Fix comparison of guest efer with stale host value
KVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c
KVM: Drop user return notifier when disabling virtualization on a cpu
KVM: VMX: Disable unrestricted guest when EPT disabled
KVM: x86 emulator: limit instructions to 15 bytes
KVM: s390: Make psw available on all exits, not just a subset
KVM: x86: Add KVM_GET/SET_VCPU_EVENTS
KVM: VMX: Report unexpected simultaneous exceptions as internal errors
KVM: Allow internal errors reported to userspace to carry extra data
KVM: Reorder IOCTLs in main kvm.h
KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG
KVM: only clear irq_source_id if irqchip is present
KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic
KVM: x86: disallow multiple KVM_CREATE_IRQCHIP
KVM: VMX: Remove vmx->msr_offset_efer
KVM: MMU: update invlpg handler comment
KVM: VMX: move CR3/PDPTR update to vmx_set_cr3
KVM: remove duplicated task_switch check
KVM: powerpc: Fix BUILD_BUG_ON condition
KVM: VMX: Use shared msr infrastructure
...
Trivial conflicts due to new Kconfig options in arch/Kconfig and kernel/Makefile
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
mac80211: fix reorder buffer release
iwmc3200wifi: Enable wimax core through module parameter
iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
iwmc3200wifi: Coex table command does not expect a response
iwmc3200wifi: Update wiwi priority table
iwlwifi: driver version track kernel version
iwlwifi: indicate uCode type when fail dump error/event log
iwl3945: remove duplicated event logging code
b43: fix two warnings
ipw2100: fix rebooting hang with driver loaded
cfg80211: indent regulatory messages with spaces
iwmc3200wifi: fix NULL pointer dereference in pmkid update
mac80211: Fix TX status reporting for injected data frames
ath9k: enable 2GHz band only if the device supports it
airo: Fix integer overflow warning
rt2x00: Fix padding bug on L2PAD devices.
WE: Fix set events not propagated
b43legacy: avoid PPC fault during resume
b43: avoid PPC fault during resume
tcp: fix a timewait refcnt race
...
Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
kernel/sysctl_check.c
net/ipv4/sysctl_net_ipv4.c
net/ipv6/addrconf.c
net/sctp/sysctl.c
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
security/tomoyo: Remove now unnecessary handling of security_sysctl.
security/tomoyo: Add a special case to handle accesses through the internal proc mount.
sysctl: Drop & in front of every proc_handler.
sysctl: Remove CTL_NONE and CTL_UNNUMBERED
sysctl: kill dead ctl_handler definitions.
sysctl: Remove the last of the generic binary sysctl support
sysctl net: Remove unused binary sysctl code
sysctl security/tomoyo: Don't look at ctl_name
sysctl arm: Remove binary sysctl support
sysctl x86: Remove dead binary sysctl support
sysctl sh: Remove dead binary sysctl support
sysctl powerpc: Remove dead binary sysctl support
sysctl ia64: Remove dead binary sysctl support
sysctl s390: Remove dead sysctl binary support
sysctl frv: Remove dead binary sysctl support
sysctl mips/lasat: Remove dead binary sysctl support
sysctl drivers: Remove dead binary sysctl support
sysctl crypto: Remove dead binary sysctl support
sysctl security/keys: Remove dead binary sysctl support
sysctl kernel: Remove binary sysctl logic
...
get_user_pages() must be called with mmap_sem held.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linuxfoundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091208121942.GA21298@basil.fritz.box>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Today's linux-next build failed with:
kernel/hw_breakpoint.c:86: error: 'task_bp_pinned' redeclared as different kind of symbol
...
Caused by commit dd17c8f729 ("percpu:
remove per_cpu__ prefix") from the percpu tree interacting with
commit 56053170ea ("hw-breakpoints:
Fix task-bound breakpoint slot allocation") from the tip tree.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20091208182515.bb6dda4a.sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
An ftrace plugin can add a pipe_open interface when the user opens
trace_pipe. But if the plugin allocates something within the pipe_open
it can not free it because there exists no pipe_close. The hook to
the trace file open has a corresponding close. The closing of the
trace_pipe file should also have a corresponding close.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Whatever the context nature of a breakpoint, we always perform the
following constraint checks before allocating it a slot:
- Check the number of pinned breakpoint bound the concerned cpus
- Check the max number of task-bound breakpoints that are belonging
to a task.
- Add both and see if we have a reamining slot for the new breakpoint
This is the right thing to do when we are about to register a cpu-only
bound breakpoint. But not if we are dealing with a task bound
breakpoint. What we want in this case is:
- Check the number of pinned breakpoint bound the concerned cpus
- Check the number of breakpoints that already belong to the task
in which the breakpoint to register is bound to.
- Add both
This fixes a regression that makes the "firefox -g" command fail to
register breakpoints once we deal with a secondary thread.
Reported-by: Walt <w41ter@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
sched domain managment) we have cpu_active_mask which is suppose to rule
scheduler migration and load-balancing, except it never (fully) did.
The particular problem being solved here is a crash in try_to_wake_up()
where select_task_rq() ends up selecting an offline cpu because
select_task_rq_fair() trusts the sched_domain tree to reflect the
current state of affairs, similarly select_task_rq_rt() trusts the
root_domain.
However, the sched_domains are updated from CPU_DEAD, which is after the
cpu is taken offline and after stop_machine is done. Therefore it can
race perfectly well with code assuming the domains are right.
Cure this by building the domains from cpu_active_mask on
CPU_DOWN_PREPARE.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit acc3f5d7ca ("cpumask:
Partition_sched_domains takes array of cpumask_var_t") changed
the function signature of generate_sched_domains() for the
CONFIG_SMP=y case, but forgot to update the corresponding
function for the CONFIG_SMP=n case, causing:
kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <alpine.DEB.2.00.0912062038070.5693@ayla.of.borg>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch (as1306) exports the PM runtime workqueue for use by
loadable modules.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Use KERN_CONT in save_image() for printks, so that anybody won't
try to add a loglevel.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Shift the remaining declaration of the variable in_suspend and the
function swsusp_show_speed from swsusp.c to hibernate.c, and delete
swsusp.c.
Signed-off-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Move hibernation code's functions for allocating and freeing swap
from swsusp.c to swap.c, which is where you'd expect to find them.
Signed-off-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Fix min, max times in /proc/lock_stats
(1) When collecting lock hold and wait times, if the current minimum
time is zero, it will be replaced by the next time.
(2) When aggregating minimum and maximum lock hold and wait times
accross cpus, the values are added, instead of selecting the
minimum and maximum.
Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4B05BBAE.2050005@am.sony.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
struct perf_event::event callback was called when a breakpoint
triggers. But this is a rather opaque callback, pretty
tied-only to the breakpoint API and not really integrated into perf
as it triggers even when we don't overflow.
We prefer to use overflow_handler() as it fits into the perf events
rules, being called only when we overflow.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>
Drop the callback and task parameters from modify_user_hw_breakpoint().
For now we have no user that need to modify a breakpoint to the point
of changing its handler or its task context.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (40 commits)
tracing: Separate raw syscall from syscall tracer
ring-buffer-benchmark: Add parameters to set produce/consumer priorities
tracing, function tracer: Clean up strstrip() usage
ring-buffer benchmark: Run producer/consumer threads at nice +19
tracing: Remove the stale include/trace/power.h
tracing: Only print objcopy version warning once from recordmcount
tracing: Prevent build warning: 'ftrace_graph_buf' defined but not used
ring-buffer: Move access to commit_page up into function used
tracing: do not disable interrupts for trace_clock_local
ring-buffer: Add multiple iterations between benchmark timestamps
kprobes: Sanitize struct kretprobe_instance allocations
tracing: Fix to use __always_unused attribute
compiler: Introduce __always_unused
tracing: Exit with error if a weak function is used in recordmcount.pl
tracing: Move conditional into update_funcs() in recordmcount.pl
tracing: Add regex for weak functions in recordmcount.pl
tracing: Move mcount section search to front of loop in recordmcount.pl
tracing: Fix objcopy revision check in recordmcount.pl
tracing: Check absolute path of input file in recordmcount.pl
tracing: Correct the check for number of arguments in recordmcount.pl
...
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Fix trace_marker output
tracing: Fix event format export
tracing: Fix return value of tracing_stats_read()
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
rcu: Make RCU's CPU-stall detector be default
rcu: Add expedited grace-period support for preemptible RCU
rcu: Enable fourth level of TREE_RCU hierarchy
rcu: Rename "quiet" functions
rcu: Re-arrange code to reduce #ifdef pain
rcu: Eliminate unneeded function wrapping
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
rcu: Eliminate __rcu_pending() false positives
rcu: Further cleanups of use of lastcomp
rcu: Simplify association of forced quiescent states with grace periods
rcu: Accelerate callback processing on CPUs not detecting GP end
rcu: Mark init-time-only rcu_bootup_announce() as __init
rcu: Simplify association of quiescent states with grace periods
rcu: Rename dynticks_completed to completed_fqs
rcu: Enable synchronize_sched_expedited() fastpath
rcu: Remove inline from forward-referenced functions
rcu: Fix note_new_gpnum() uses of ->gpnum
rcu: Fix synchronization for rcu_process_gp_end() uses of ->completed counter
rcu: Prepare for synchronization fixes: clean up for non-NO_HZ handling of ->completed counter
rcu: Cleanup: balance rcu_irq_enter()/rcu_irq_exit() calls
...
* 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ratelimit: Make suppressed output messages more useful
printk: Remove ratelimit.h from kernel.h
ratelimit: Fix/allow use in atomic contexts
ratelimit: Use per ratelimit context locking
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
mutex: Fix missing conditions to build mutex_spin_on_owner()
mutex: Better control mutex adaptive spinning config
locking, task_struct: Reduce size on TRACE_IRQFLAGS and 64bit
locking: Use __[SPIN|RW]_LOCK_UNLOCKED in [spin|rw]_lock_init()
locking: Remove unused prototype
locking: Reduce ifdefs in kernel/spinlock.c
locking: Make inlining decision Kconfig based
With CLONE_IO, parent's io_context->nr_tasks is incremented, but never
decremented whenever copy_process() fails afterwards, which prevents
exit_io_context() from calling IO schedulers exit functions.
Give a task_struct to exit_io_context(), and call exit_io_context() instead of
put_io_context() in copy_process() cleanup path.
Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
commit 8ec1e0ebe26087bfc5c0394ada5feb5758014fc8
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:16:35 2009 +0100
ipv4: add sysctl to accept packets with local source addresses
Change fib_validate_source() to accept packets with a local source address when
the "accept_local" sysctl is set for the incoming inet device. Combined with the
previous patches, this allows to communicate between multiple local interfaces
over the wire.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't need to build mutex_spin_on_owner() if we have
CONFIG_DEBUG_MUTEXES or CONFIG_HAVE_DEFAULT_NO_SPIN_MUTEXES as
it won't be used under such configs.
Use CONFIG_MUTEX_SPIN_ON_OWNER as it gathers all the necessary
checks before building it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1259783357-8542-2-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Introduce CONFIG_MUTEX_SPIN_ON_OWNER so that we can centralize
in a single place the conditions that determine its definition
and use.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1259783357-8542-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Enable a fourth level of rcu_node hierarchy for TREE_RCU and
TREE_PREEMPT_RCU. This is for stress-testing and experiemental
purposes only, although in theory this would enable 16,777,216
CPUs on 64-bit systems, though only 1,048,576 CPUs on 32-bit
systems. Normal experimental use of this fourth level will
normally set CONFIG_RCU_FANOUT=2, requiring a 16-CPU system,
though the more adventurous (and more fortunate) experimenters
may wish to chose CONFIG_RCU_FANOUT=3 for 81-CPU systems or even
CONFIG_RCU_FANOUT=4 for 256-CPU systems.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846161257-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The number of "quiet" functions has grown recently, and the
names are no longer very descriptive. The point of all of these
functions is to do some portion of the task of reporting a
quiescent state, so rename them accordingly:
o cpu_quiet() becomes rcu_report_qs_rdp(), which reports a
quiescent state to the per-CPU rcu_data structure. If this
turns out to be a new quiescent state for this grace period,
then rcu_report_qs_rnp() will be invoked to propagate the
quiescent state up the rcu_node hierarchy.
o cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports
a quiescent state for a given CPU (or possibly a set of CPUs)
up the rcu_node hierarchy.
o cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which
reports a full set of quiescent states to the global rcu_state
structure.
o task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports
a quiescent state due to a task exiting an RCU read-side critical
section that had previously blocked in that same critical section.
As indicated by the new name, this type of quiescent state is
reported up the rcu_node hierarchy (using rcu_report_qs_rnp()
to do so).
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846163698-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On the parisc architecture we face for each and every loaded kernel module
this kernel "badness warning":
sysfs: cannot create duplicate filename '/module/ac97_bus/sections/.text'
Badness at fs/sysfs/dir.c:487
Reason for that is, that on parisc all kernel modules do have multiple
.text sections due to the usage of the -ffunction-sections compiler flag
which is needed to reach all jump targets on this platform.
An objdump on such a kernel module gives:
Sections:
Idx Name Size VMA LMA File off Algn
0 .note.gnu.build-id 00000024 00000000 00000000 00000034 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .text 00000000 00000000 00000000 00000058 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
2 .text.ac97_bus_match 0000001c 00000000 00000000 00000058 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
3 .text 00000000 00000000 00000000 000000d4 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
...
Since the .text sections are empty (size of 0 bytes) and won't be
loaded by the kernel module loader anyway, I don't see a reason
why such sections need to be listed under
/sys/module/<module_name>/sections/<section_name> either.
The attached patch does solve this issue by not exporting section
names which are empty.
This fixes bugzilla http://bugzilla.kernel.org/show_bug.cgi?id=14703
Signed-off-by: Helge Deller <deller@gmx.de>
CC: rusty@rustcorp.com.au
CC: akpm@linux-foundation.org
CC: James.Bottomley@HansenPartnership.com
CC: roland@redhat.com
CC: dave@hiauly1.hia.nrc.ca
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a real fix for problem of utime/stime values decreasing
described in the thread:
http://lkml.org/lkml/2009/11/3/522
Now cputime is accounted in the following way:
- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).
- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().
- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.
So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.
And accounted values are used by:
- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.
- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.
The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:
group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).
To fix this, we could do:
group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
But task_times() contains hard divisions, so applying it for
every thread should be avoided.
This patch fixes the above problem in the following way:
- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.
- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().
- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.
- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
This patch have a positive side effect:
- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.
v2:
- remove if()s, put new variables into signal_struct.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- Remove if({u,s}t)s because no one call it with NULL now.
- Use cputime_{add,sub}().
- Add ifndef-endif for prev_{u,s}time since they are used
only when !VIRT_CPU_ACCOUNTING.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <4B1624C7.7040302@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Anton Blanchard wrote:
> We allocate and zero cpu_isolated_map after the isolcpus
> __setup option has run. This means cpu_isolated_map always
> ends up empty and if CPUMASK_OFFSTACK is enabled we write to a
> cpumask that hasn't been allocated.
I introduced this regression in 49557e6203 (sched: Fix
boot crash by zalloc()ing most of the cpu masks).
Use the bootmem allocator if they set isolcpus=, otherwise
allocate and zero like normal.
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: peterz@infradead.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <200912021409.17013.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Anton Blanchard <anton@samba.org>
Instead of using per_cpu(..., raw_smp_processor_id()), use
__get_cpu_var(...).
Signed-off-by: Avi Kivity <avi@redhat.com>
LKML-Reference: <1259578491-4589-1-git-send-email-avi@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
498657a478 incorrectly assumed
that preempt wasn't disabled around context_switch() and thus
was fixing imaginary problem. It also broke KVM because it
depended on ->sched_in() to be called with irq enabled so that
it can do smp calls from there.
Revert the incorrect commit and add comment describing different
contexts under with the two callbacks are invoked.
Avi: spotted transposed in/out in the added comment.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Avi Kivity <avi@redhat.com>
Cc: peterz@infradead.org
Cc: efault@gmx.de
Cc: rusty@rustcorp.com.au
LKML-Reference: <1259726212-30259-2-git-send-email-tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kmsg_dump() fails to build when CONFIG_PRINTK=n; provide stubs
for the kmsg_dump*() functions when CONFIG_PRINTK=n.
kernel/printk.c: In function 'kmsg_dump':
kernel/printk.c:1501: error: 'log_buf_len' undeclared (first use in this function)
kernel/printk.c:1502: error: 'logged_chars' undeclared (first use in this function)
kernel/printk.c:1506: error: 'log_buf' undeclared (first use in this function)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
In the CONFIG_PERF_USE_VMALLOC case, perf_mmap_data_free() only
schedules the cleanup of the perf_mmap_data struct. In that
case we have to wait until the work has been done before we free
data.
Signed-off-by: Kristian Høgsberg <krh@bitplanet.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org>
LKML-Reference: <1259697901-1747-1-git-send-email-krh@bitplanet.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After duplications are removed, syscall_name_to_nr() is unused.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D2A6.6060803@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use only one prof_sysenter_enable() instead of
prof_sysenter_enable_##sname()
use only one prof_sysenter_disable() instead of
prof_sysenter_disable_##sname()
use only one prof_sysexit_enable() instead of
prof_sysexit_enable_##sname()
use only one prof_sysexit_disable() instead of
prof_sysexit_disable_##sname()
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D2A1.8060304@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use only one init_syscall_trace instead of
many init_enter_##sname()/init_exit_##sname()
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D29B.6090708@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add syscall_nr field to struct syscall_metadata,
it helps us to get syscall number easier.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D293.6090800@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use ->enter_event->id instead of ->enter_id
use ->exit_event->id instead of ->exit_id
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D288.7030001@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Set event_enter_##sname->data to its metadata,
it makes codes simpler.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D282.7050709@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commits 3d7a641 ("SLOW_WORK: Wait for outstanding work items belonging to a
module to clear") introduced some code to make sure that all of a module's
slow-work items were complete before that module was removed, and commit
3bde31a ("SLOW_WORK: Allow a requeueable work item to sleep till the thread is
needed") further extended that, breaking it in the process if CONFIG_MODULES=n:
CC kernel/slow-work.o
kernel/slow-work.c: In function 'slow_work_execute':
kernel/slow-work.c:313: error: 'slow_work_thread_processing' undeclared (first use in this function)
kernel/slow-work.c:313: error: (Each undeclared identifier is reported only once
kernel/slow-work.c:313: error: for each function it appears in.)
kernel/slow-work.c: In function 'slow_work_wait_for_items':
kernel/slow-work.c:950: error: 'slow_work_unreg_sync_lock' undeclared (first use in this function)
kernel/slow-work.c:951: error: 'slow_work_unreg_wq' undeclared (first use in this function)
kernel/slow-work.c:961: error: 'slow_work_unreg_work_item' undeclared (first use in this function)
kernel/slow-work.c:974: error: 'slow_work_unreg_module' undeclared (first use in this function)
kernel/slow-work.c:977: error: 'slow_work_thread_processing' undeclared (first use in this function)
make[1]: *** [kernel/slow-work.o] Error 1
Fix this by:
(1) Extracting the bits of slow_work_execute() that are contingent on
CONFIG_MODULES, and the bits that should be, into inline functions and
placing them into the #ifdef'd section that defines the relevant variables
and adding stubs for moduleless kernels. This allows the removal of some
#ifdefs.
(2) #ifdef'ing out the contents of slow_work_wait_for_items() in moduleless
kernels.
The four functions related to handling module unloading synchronisation (and
their associated variables) could be offloaded into a separate .c file, but
each function is only used once and three of them are tiny, so doing so would
prevent them from being inlined.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a memory leak case in create_trace_probe(). When an argument
is too long (> MAX_ARGSTR_LEN), it just jumps to error path. In
that case tp->args[i].name is not released.
This also fixes a bug to check kstrdup()'s return value.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091201001919.10235.56455.stgit@harusame>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The core functionality is implemented as per Linus suggestion from
http://lists.infradead.org/pipermail/linux-mtd/2009-October/027620.html
(with the kmsg_dump implementation by Linus). A struct kmsg_dumper has
been added which contains a callback to dump the kernel log buffers on
crashes. The kmsg_dump function gets called from oops_exit() and panic()
and invokes this callbacks with the crash reason.
[dwmw2: Fix log_end handling]
Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Reviewed-by: Anders Grafstrom <anders.grafstrom@netinsight.net>
Reviewed-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
fork() clones all thread_info flags, including
TIF_USER_RETURN_NOTIFY; if the new task is first scheduled on a cpu
which doesn't have user return notifiers set, this causes user
return notifiers to trigger without any way of clearing itself.
This is easy to trigger with a forky workload on the host in
parallel with kvm, resulting in a cpu in an endless loop on the
verge of returning to userspace.
Fix by dropping the TIF_USER_RETURN_NOTIFY immediately after fork.
Signed-off-by: Avi Kivity <avi@redhat.com>
LKML-Reference: <1259505288-16559-1-git-send-email-avi@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
"symbol_name+0" is not so friendly.
It makes the output longer.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B0CEBCB.7080309@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Sometimes the group name is not "kprobes",
It'll be better if we can read it from tracing/kprobe_events.
# echo 'r:laijs/vfs_read vfs_read %ax' > kprobe_events
# cat kprobe_events
r:laijs/vfs_read vfs_read %ax=%ax
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B0CEBAF.6000104@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tp->nr_args is not set before we "goto error",
it causes memory leak for free_trace_probe() use tp->nr_args
to free memory of args.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B0CEB95.2060107@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Field syscall number is missed in syscall_enter_define_fields()/
syscall_exit_define_fields().
Syscall number is also needed for event filter or other users.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B0E330D.1070206@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kernel breakpoints are created using functions in which we pass
breakpoint parameters as individual variables: address, length
and type.
Although it fits well for x86, this just does not scale across
architectures that may support this api later as these may have
more or different needs. Pass in a perf_event_attr structure
instead because it is meant to evolve as much as possible into
a generic hardware breakpoint parameter structure.
Reported-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1259294154-5197-2-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In-kernel user breakpoints are created using functions in which
we pass breakpoint parameters as individual variables: address,
length and type.
Although it fits well for x86, this just does not scale across
archictectures that may support this api later as these may have
more or different needs. Pass in a perf_event_attr structure
instead because it is meant to evolve as much as possible into
a generic hardware breakpoint parameter structure.
Reported-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1259294154-5197-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I'm seeing spikes of up to 0.5ms in khungtaskd on a large
machine. To reduce this source of jitter I tried setting
hung_task_check_count to 0:
# echo 0 > /proc/sys/kernel/hung_task_check_count
which didn't have the intended response. Change to a post
increment of max_count, so a value of 0 means check 0 tasks.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: msb@google.com
LKML-Reference: <20091127022820.GU32182@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a pinned group cannot be scheduled it goes into error state.
Normally a group cannot go out of error state without being
explicitly re-enabled or disabled. There was a bug in per-thread
mode, whereby upon termination of the thread, the group would
transition from error to off leading to bogus counts and timing
information returned by read().
Fix it by clearing the error state.
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: perfmon2-devel@lists.sourceforge.net
LKML-Reference: <4b0eb9ce.0508d00a.573b.ffffeab6@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use of msecs_to_jiffies() for nsecs_to_cputime() have some
problems:
- The type of msecs_to_jiffies()'s argument is unsigned int, so
it cannot convert msecs greater than UINT_MAX = about 49.7 days.
- msecs_to_jiffies() returns MAX_JIFFY_OFFSET if MSB of argument
is set, assuming that input was negative value. So it cannot
convert msecs greater than INT_MAX = about 24.8 days too.
This patch defines a new function nsecs_to_jiffies() that can
deal greater values, and that can deal all incoming values as
unsigned.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Amrico Wang <xiyou.wangcong@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
LKML-Reference: <4B0E16E7.5070307@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now all task_{u,s}time() pairs are replaced by task_times().
And task_gtime() is too simple to be an inline function.
Cleanup them all.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
LKML-Reference: <4B0E16D1.70902@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Functions task_{u,s}time() are called in pair in almost all
cases. However task_stime() is implemented to call task_utime()
from its inside, so such paired calls run task_utime() twice.
It means we do heavy divisions (div_u64 + do_div) twice to get
utime and stime which can be obtained at same time by one set
of divisions.
This patch introduces a function task_times(*tsk, *utime,
*stime) to retrieve utime and stime at once in better, optimized
way.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
LKML-Reference: <4B0E16AE.906@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
bp_perf_event_destroy() is unused in its off-case version, let's
remove it to fix the following warning reported by Stephen
Rothwell in linux-next:
kernel/perf_event.c:4306: warning: 'bp_perf_event_destroy' defined but not used
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1259180453-5813-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Limit the number of per cpu calibration messages by only
printing out results for the first cpu to boot.
Also, don't print "CPUx is down" as this is expected, and we
don't need 4096 reminders... ;-)
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091118002219.889552000@alcatraz.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If the new percpu tree is combined with the perf events tree
the following new warning triggers:
kernel/hw_breakpoint.c: In function 'toggle_bp_task_slot':
kernel/hw_breakpoint.c:151: warning: 'task_bp_pinned' is used uninitialized in this function
Because it's not valid anymore to define a local variable
and a percpu variable (even if it's file scope local) with
the same name.
Rename the local variable to resolve this.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <200911260701.nAQ71owx016356@imap1.linux-foundation.org>
[ v2: added changelog ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This simplifies the error handling when we create a breakpoint.
We don't need to check the NULL return value corner case anymore
since we have improved perf_event_create_kernel_counter() to
always return an error code in the failure case.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <1259210142-5714-3-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In fail case, perf_event_create_kernel_counter() returns NULL
instead of an error, which doesn't help us to inform the user
about the origin of the problem from the outer most callers.
Often we can just return -EINVAL, which doesn't help anyone when
it's eventually about a memory allocation failure.
Then, this patch makes perf_event_create_kernel_counter() always
return a detailed error code.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <1259210142-5714-2-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The error path of a breakpoint modification is broken in
the ksym tracer. A modified breakpoint hlist node is immediately
released after its removal. Also we leak a breakpoint in this
case.
Fix the path.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <1259210142-5714-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Running the ring-buffer-benchmark's threads at the lowest priority may
work well for keeping it in the background, but it is not appropriate
for the benchmarks.
This patch adds 4 parameters to the module:
consumer_fifo
consumer_nice
producer_fifo
producer_nice
By default the consumer and producer still run at nice +19.
If the *_fifo options are set, they will override the *_nice values.
modprobe ring_buffer_benchmark consumer_nice=0 producer_fifo=10
The above will set the consumer thread to a nice value of 0, and
the producer thread to a RT SCHED_FIFO priority of 10.
Note, this patch also fixes a bug where calling set_user_nice on the
consumer thread would oops the kernel when the parameter "disable_reader"
is set.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In commit v2.6.21-691-g39bc89f ("make SysRq-T show all tasks
again") the interface of show_state_filter() was changed: zero
valued 'state_filter' specifies "dump all tasks" (instead of -1).
However, the condition for calling debug_show_all_locks() ("show
locks if all tasks are dumped") was not updated accordingly.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: peterz@infradead.org
LKML-Reference: <4b0d2fe4.0ab6660a.6437.3cfc@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit ee949a86b3 ("tracing/syscalls:
Use long for syscall ret format and field definitions") changed the
syscall exit return type to long, but forgot to change it in the
struct.
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1259133299-23594-3-git-send-email-tzanussi@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 4ed7c92d68
(perf_events: Undo some recursion damage) has introduced a bad
reference counting of the recursion context. putting the context
behaves like getting it, dropping every software/trace events
after the first one in a context.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1259091502-5171-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When using an event group, the value and id for non leaders events
were wrong due to invalid offset into the outgoing buffer.
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: paulus@samba.org
Cc: perfmon2-devel@lists.sourceforge.net
LKML-Reference: <4b0b71e1.0508d00a.075e.ffff84a3@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As far as I know, all distros currently ship kernels with default
CONFIG_SECURITY_FILE_CAPABILITIES=y. Since having the option on
leaves a 'no_file_caps' option to boot without file capabilities,
the main reason to keep the option is that turning it off saves
you (on my s390x partition) 5k. In particular, vmlinux sizes
came to:
without patch fscaps=n: 53598392
without patch fscaps=y: 53603406
with this patch applied: 53603342
with the security-next tree.
Against this we must weigh the fact that there is no simple way for
userspace to figure out whether file capabilities are supported,
while things like per-process securebits, capability bounding
sets, and adding bits to pI if CAP_SETPCAP is in pE are not supported
with SECURITY_FILE_CAPABILITIES=n, leaving a bit of a problem for
applications wanting to know whether they can use them and/or why
something failed.
It also adds another subtly different set of semantics which we must
maintain at the risk of severe security regressions.
So this patch removes the SECURITY_FILE_CAPABILITIES compile
option. It drops the kernel size by about 50k over the stock
SECURITY_FILE_CAPABILITIES=y kernel, by removing the
cap_limit_ptraced_target() function.
Changelog:
Nov 20: remove cap_limit_ptraced_target() as it's logic
was ifndef'ed.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Andrew G. Morgan" <morgan@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
When libcap, or other libraries attempt to confirm/determine the supported
capability version magic, they generally supply a NULL dataptr to capget().
In this case, while returning the supported/preferred magic (via a
modified header content), the return code of this system call may be 0,
-EINVAL, or -EFAULT.
No libcap code depends on the previous -EINVAL etc. return code, and
all of the above three return codes can accompany a valid (successful)
attempt to determine the requested magic value.
This patch cleans up the system call to return 0, if the call is
successfully being used to determine the supported/preferred capability
magic value.
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Add the remaining necessary bits to support breakpoints created
through perf syscall.
We don't use the software counter interface as:
- We don't need to check against recursion, this is already done
in hardware breakpoints arch level.
- We already know the perf event we are dealing with when the
event is to be committed.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <1258987355-8751-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Perf tools create perf events as disabled in the beginning.
Breakpoints are then considered like ptrace temporary
breakpoints, only meant to reserve a breakpoint slot until we
get all the necessary informations from the user.
In this case, we don't check the address that is breakpointed as
it is NULL in the ptrace case.
But perf tools don't have the same purpose, events are created
disabled to wait for all events to be created before enabling
all of them. We want to check the breakpoint parameters in this
case.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <1258987355-8751-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Attribute authorship to developers of hw-breakpoint related
files.
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091123154713.GA5593@in.ibm.com>
[ v2: moved it to latest -tip ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It is quite possible to call update_event_times() on a context
that isn't actually running and thereby confuse the thing.
perf stat was reporting !100% scale values for software counters
(2e2af50b perf_events: Disable events when we detach them,
solved the worst of that, but there was still some left).
The thing that happens is that because we are not self-reaping
(we have a caring parent) there is a time between the last
schedule (out) and having do_exit() called which will detach the
events.
This period would be accounted as enabled,!running because the
event->state==INACTIVE, even though !event->ctx->is_active.
Similar issues could have been observed by calling read() on a
event while the attached task was not scheduled in.
Solve this by teaching update_event_times() about
ctx->is_active.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1258984836.4531.480.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make perf_swevent_get_recursion_context return a context number
and disable preemption.
This could be used to remove the IRQ disable from the trace bit
and index the per-cpu buffer with.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091123103819.993226816@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the update_event_times() call in __perf_event_exit_task()
into list_del_event() because that holds the proper lock
(ctx->lock) and seems a more natural place to do the last time
update.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091123103819.842455480@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It appeared we did call update_event_times() on exit, but we
failed to update the context time, which renders the former
moot.
Locking is a bit iffy, we call update_event_times under
ctx->mutex instead of ctx->lock - the next patch fixes this.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091123103819.764207355@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If we leave the event in STATE_INACTIVE, any read of the event
after the detach will increase the running count but not the
enabled count and cause funny scaling artefacts.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091123103819.689055515@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We had two almost identical functions, avoid the duplication.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20091123103819.537537928@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean up strstrip() usage - which also addresses this build warning:
kernel/trace/ftrace.c: In function 'ftrace_pid_write':
kernel/trace/ftrace.c:3004: warning: ignoring return value of 'strstrip', declared with attribute warn_unused_result
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Decreases perf overhead when function tracing is enabled,
by about 50%.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The ring-buffer benchmark threads run on nice 0 by default, using
up a lot of CPU time and slowing down the system:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1024 root 20 0 0 0 0 D 95.3 0.0 4:01.67 rb_producer
1023 root 20 0 0 0 0 R 93.5 0.0 2:54.33 rb_consumer
21569 mingo 40 0 14852 1048 772 R 3.6 0.1 0:00.05 top
1 root 40 0 4080 928 668 S 0.0 0.0 0:23.98 init
Renice them to +19 to make them less intrusive.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The buffer is first zeroed out by memset(). Then strncpy() is
used to fill the content. The strncpy() function also pads the
string till the end of the specified length, which is redundant.
The strncpy() does not ensures that the string will be properly
closed with 0. Use strlcpy() instead.
The semantic match that finds this kind of pattern is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression buffer;
expression size;
expression str;
@@
memset(buffer, 0, size);
...
- strncpy(
+ strlcpy(
buffer, str, sizeof(buffer)
);
@@
expression buffer;
expression size;
expression str;
@@
memset(&buffer, 0, size);
...
- strncpy(
+ strlcpy(
&buffer, str, sizeof(buffer));
@@
expression buffer;
identifier field;
expression size;
expression str;
@@
memset(buffer, 0, size);
...
- strncpy(
+ strlcpy(
buffer->field, str, sizeof(buffer->field)
);
@@
expression buffer;
identifier field;
expression size;
expression str;
@@
memset(&buffer, 0, size);
...
- strncpy(
+ strlcpy(
buffer.field, str, sizeof(buffer.field));
// </smpl>
On strncpy() vs strlcpy() see
http://www.gratisoft.us/todd/papers/strlcpy.html .
Signed-off-by: Márton Németh <nm127@freemail.hu>
Cc: Julia Lawall <julia@diku.dk>
Cc: cocci@diku.dk
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B086547.5040100@freemail.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove asm/processor.h and asm/debugreg.h as these headers are
not used anymore in the hw-breakpoints core file.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <1258863695-10464-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We are never in an NMI context when we commit a syscall trace to
perf. So just forget about the nmi buffer there.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <1258863695-10464-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we commit a trace to perf, we first check if we are
recursing in the same buffer so that we don't mess-up the buffer
with a recursing trace. But later on, we do the same check from
perf to avoid commit recursion. The recursion check is desired
early before we touch the buffer but we want to do this check
only once.
Then export the recursion protection from perf and use it from
the trace events before submitting a trace.
v2: Put appropriate Reported-by tag
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <1258864015-10579-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes the default watermark value for the sampling
buffer. With the existing calculation (watermark =
max(PAGE_SIZE, max_size / 2)), no notification was ever received
when the buffer was exactly 1 page. This was because you would
never cross the threshold (there is no partial samples).
In certain configuration, there was no possibilty detecting the
problem because there was not enough space left to store the
LOST record.In fact, there may be a more generic problem here.
The kernel should ensure that there is alaways enough space to
store one LOST record.
This patch sets the default watermark to half the buffer size.
With such limit, we are guaranteed to get a notification even
with a single page buffer assuming no sample is bigger than a
page.
Signed-off-by: Stephane Eranian <eranian@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212509.344964101@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1256302576-6169-1-git-send-email-eranian@gmail.com>
We should hold event->child_mutex when iterating the inherited
counters, we should hold ctx->mutex when iterating siblings.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212509.251030114@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Properly account the full hierarchy of counters for both the
count (we already did so) and the scale times (new).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212509.153379276@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Most sites updating ctx->time and event times do so under
ctx->lock, make sure they all do.
This was made possible by removing the __perf_event_read() call
from __perf_event_sync_stat(), which already had this lock
taken.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212509.102316434@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cpuctx is always active, task context is always active for
current
the previous condition verifies that if its a task context its
for current, hence we can assume ctx->is_active.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212509.000272254@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Removes constraints from __perf_event_read() by leaving it with
a single callsite; this callsite had ctx->lock held, the other
one does not.
Removes some superfluous code from __perf_event_sync_stat().
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.918544317@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Both callers actually have IRQs disabled, no need doing so
again.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.863685796@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove an update_context_time() call from the
perf_event_task_sched_out() path and into the branch its needed.
The call was both superfluous, because __perf_event_sched_out()
already does it, and wrong, because it was done without holding
ctx->lock.
Place it in perf_event_sync_stat(), which is the only place it
is needed and which does already hold ctx->lock.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.779516394@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As Corey reported, the total_enabled and total_running times
could occasionally be 0, even though there were events counted.
It turns out this is because we record the times before reading
the counter while the latter updates the times.
This patch corrects that.
While looking at this code I found that there is a lot of
locking iffyness around, the following patches correct most of
that.
Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.685559857@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove a rcu_read_{,un}lock() pair and a few conditionals.
We can remove the rcu_read_lock() by increasing the scope of one
in the calling function.
We can do away with the system_state check if the machine still
boots after this patch (seems to be the case).
We can do away with the list_empty() check because the bare
list_for_each_entry_rcu() reduces to that now that we've removed
everything else.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.606459548@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove a rcu_read_{,un}lock() pair and a few conditionals.
We can remove the rcu_read_lock() by increasing the scope of one
in the calling function.
We can do away with the system_state check if the machine still
boots after this patch (seems to be the case).
We can do away with the list_empty() check because the bare
list_for_each_entry_rcu() reduces to that now that we've removed
everything else.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.527608793@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove a rcu_read_{,un}lock() pair and a few conditionals.
We can remove the rcu_read_lock() by increasing the scope of one
in the calling function.
We can do away with the system_state check if the machine still
boots after this patch (seems to be the case).
We can do away with the list_empty() check because the bare
list_for_each_entry_rcu() reduces to that now that we've removed
everything else.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.452227115@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove a rcu_read_{,un}lock() pair and a few conditionals.
We can remove the rcu_read_lock() by increasing the scope of one
in the calling function.
We can do away with the system_state check if the machine still
boots after this patch (seems to be the case).
We can do away with the list_empty() check because the bare
list_for_each_entry_rcu() reduces to that now that we've removed
everything else.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.378188589@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Avoid the rather expensive perf_swevent_set_period() if we know
we have to sample every single event anyway.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.299508332@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
in-kernel perf users might wish to have custom actions on the
sample interrupt.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091120212508.222339539@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/x86/kernel/kprobes.c
kernel/trace/Makefile
Merge reason: hw-breakpoints perf integration is looking
good in testing and in reviews, plus conflicts
are mounting up - so merge & resolve.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
single_open data argument must be PDE(inode)->data instead of NULL
otherwise seq_file->private is always NULL and we always read the
spurious data of irq 0.
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a function to allow a requeueable work item to sleep till the thread
processing it is needed by the slow-work facility to perform other work.
Sometimes a work item can't progress immediately, but must wait for the
completion of another work item that's currently being processed by another
slow-work thread.
In some circumstances, the waiting item could instead - theoretically - put
itself back on the queue and yield its thread back to the slow-work facility,
thus waiting till it gets processing time again before attempting to progress.
This would allow other work items processing time on that thread.
However, this only works if there is something on the queue for it to queue
behind - otherwise it will just get a thread again immediately, and will end
up cycling between the queue and the thread, eating up valuable CPU time.
So, slow_work_sleep_till_thread_needed() is provided such that an item can put
itself on a wait queue that will wake it up when the event it is actually
interested in occurs, then call this function in lieu of calling schedule().
This function will then sleep until either the item's event occurs or another
work item appears on the queue. If another work item is queued, but the
item's event hasn't occurred, then the work item should requeue itself and
yield the thread back to the slow-work facility by returning.
This can be used by CacheFiles for an object that is being created on one
thread to wait for an object being deleted on another thread where there is
nothing on the queue for the creation to go and wait behind. As soon as an
item appears on the queue that could be given thread time instead, CacheFiles
can stick the creating object back on the queue and return to the slow-work
facility - assuming the object deletion didn't also complete.
Signed-off-by: David Howells <dhowells@redhat.com>
This adds support for starting slow work with a delay, similar
to the functionality we have for workqueues.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Add support for cancellation of queued slow work and delayed slow work items.
The cancellation functions will wait for items that are pending or undergoing
execution to be discarded by the slow work facility.
Attempting to enqueue work that is in the process of being cancelled will
result in ECANCELED.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Make the ability for the slow-work facility to take references on a work item
optional as not everyone requires this.
Even the internal slow-work stubs them out, so those can be got rid of too.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Wait for outstanding slow work items belonging to a module to clear when
unregistering that module as a user of the facility. This prevents the put_ref
code of a work item from being taken away before it returns.
Signed-off-by: David Howells <dhowells@redhat.com>
For consistency drop & in front of every proc_handler. Explicity
taking the address is unnecessary and it prevents optimizations
like stubbing the proc_handlers to NULL.
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
incr_error and error fields of struct cpu_itimer are used when calculating
next timer tick in check_cpu_itimers() and should not be modified without
tsk->sighand->siglock taken.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <1253802903-979-1-git-send-email-sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andrew points out that acpi-cpufreq uses cpumask_any, when it really
would prefer to use the same CPU if possible (to avoid an IPI). In
general, this seems a good idea to offer.
[ tglx: Documented selection preference and Inlined the UP case to
avoid the copy of smp_call_function_single() and the extra
EXPORT ]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ tglx: compacted it a bit ]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
LKML-Reference: <20090828181743.GA14050@x200.localdomain>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We have already new_timer initialized to all-zeros hence in function
initializations are not needed. Document function expectation about
new_timer argument as well.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: johnstul@us.ibm.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Include "tick-internal.h" in order to pick up the extern function
prototype for clockevents_shutdown(). This quiets the following sparse
build noise:
warning: symbol 'clockevents_shutdown' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
LKML-Reference: <BD79186B4FD85F4B8E60E381CAEE190901E24550@mi8nycmail19.Mi8.com>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: johnstul@us.ibm.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit 65a6446434 ("HWPOISON: Allow
schedule_on_each_cpu() from keventd") which allows schedule_on_each_cpu()
to be called from keventd added a race condition. schedule_on_each_cpu()
may race with cpu hotplug and end up executing the function twice on a
cpu.
Fix it by moving direct execution into the section protected with
get/put_online_cpus(). While at it, update code such that direct
execution is done after works have been scheduled for all other cpus and
drop unnecessary cpu != orig test from flush loop.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prevent build warning when CONFIG_FUNCTION_GRAPH_TRACER is not set.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4AF24381.5060307@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a string was written to <debugfs>/tracing/trace_marker, some
strange characters appeared in the trace output instead of the
string, since a vprint function erroneously called a vararg print
function with a va_list argument. This patch fixes the problem and
simplifies the related code.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
LKML-Reference: <4B01AE5D.1010801@osadl.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the change of the way we process commits. Where a commit only happens
at the outer most level, and that we don't need to worry about
a commit ending after the rb_start_commit() has been called, the code
use to grab the commit page before the tail page to prevent a possible
race. But this race no longer exists with the rb_start_commit()
rb_end_commit() interface.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since commit 0a544198 "timekeeping: Move NTP adjusted clock multiplier
to struct timekeeper" the clock multiplier of vsyscall is updated with
the unmodified clock multiplier of the clock source and not with the
NTP adjusted multiplier of the timekeeper.
This causes user space observerable time warps:
new CLOCK-warp maximum: 120 nsecs, 00000025c337c537 -> 00000025c337c4bf
Add a new argument "mult" to update_vsyscall() and hand in the
timekeeping internal NTP adjusted multiplier.
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Cc: "Zhang Yanmin" <yanmin_zhang@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
LKML-Reference: <1258436990.17765.83.camel@minggr.sh.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler
gets a small bug fix and the sysctl tree where I am removing all
sysctl strategy routines.
The purpose of perf_output_{un,}lock() is to:
1) avoid publishing incomplete data
[ possible when publishing a head that is ahead of an entry
that is still being written ]
2) guarantee fwd progress
[ a simple refcount on pending writers doesn't need to drop to
0, making it so would end up implementing something like forced
quiecent states of RCU ]
To satisfy the above without undue complexity it serializes
between CPUs, this means that a pending writer can only be the
same cpu in a nested context, and since (under normal operation)
a cpu always makes progress we're good -- if the head is only
published when the bottom most writer completes.
Now we don't need to disable IRQs in order to serialize between
CPUs, disabling preemption ought to be sufficient, esp since we
already deal with nesting due to NMIs.
This avoids potentially expensive (and needless) local IRQ
disable/enable ops.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1258373161.26714.254.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Heiko reported a case where a timer interrupt managed to
reference a root_domain structure that was already freed by a
concurrent hot-un-plug operation.
Solve this like the regular sched_domain stuff is also
synchronized, by adding a synchronize_sched() stmt to the free
path, this ensures that a root_domain stays present for any
atomic section that could have observed it.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Siddha Suresh B <suresh.b.siddha@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
LKML-Reference: <1258363873.26714.83.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add debugobject support to track the life time of work_structs.
While at it, remove duplicate definition of
INIT_DELAYED_WORK_ON_STACK().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
In finish_task_switch(), fire_sched_in_preempt_notifiers() is
called after finish_lock_switch().
However, depending on architecture, preemption can be enabled after
finish_lock_switch() which breaks the semantics of preempt
notifiers.
So move it before finish_arch_switch(). This also makes the in-
notifiers symmetric to out- notifiers in terms of locking - now
both are called under rq lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4AFD2801.7020900@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that there are both ->gpnum and ->completed fields in the
rcu_node structure, __rcu_pending() should check rdp->gpnum and
rdp->completed against rnp->gpnum and rdp->completed, respectively,
instead of the prior comparison against the rcu_state fields
rsp->gpnum and rsp->completed.
Given the old comparison, __rcu_pending() could return 1, resulting
in a needless raise_softirq(RCU_SOFTIRQ). This useless work would
happen if RCU responded to a scheduling-clock interrupt after the
rcu_state fields had been updated, but before the rcu_node fields
had been updated.
Changing the comparison from the rcu_state fields to the rcu_node
fields prevents this useless work from happening.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12581706991966-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
powerpc grew a new warning due to the type change of clockevent->mult.
The architectures which use parts of the generic time keeping
infrastructure tripped over my wrong assumption that
clocksource_register is only used when GENERIC_TIME=y.
I should have looked and also I should have known better. These
renitent Gaul villages are racking my nerves. Some serious deprecating
is due.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the Kconfig based inline decisions we can remove extra ifdefs in
kernel/spinlock.c by creating the complex lockbreak functions as
inlines which are inserted into the non inlined lock functions.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091109151428.548614772@linutronix.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
commit 892a7c67 (locking: Allow arch-inlined spinlocks) implements the
selection of which lock functions are inlined based on defines in
arch/.../spinlock.h: #define __always_inline__LOCK_FUNCTION
Despite of the name __always_inline__* the lock functions can be built
out of line depending on config options. Also if the arch does not set
some inline defines the generic code might set them; again depending on
config options.
This makes it unnecessary hard to figure out when and which lock
functions are inlined. Aside of that it makes it way harder and
messier for -rt to manipulate the lock functions.
Convert the inlining decision to CONFIG switches. Each lock function
is inlined depending on CONFIG_INLINE_*. The configs implement the
existing dependencies. The architecture code can select ARCH_INLINE_*
to signal that it wants the corresponding lock function inlined.
ARCH_INLINE_* is necessary as Kconfig ignores "depends on"
restrictions when a config element is selected.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091109151428.504477141@linutronix.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
In the dynamic tick code, "max_delta_ns" (member of the
"clock_event_device" structure) represents the maximum sleep time
that can occur between timer events in nanoseconds.
The variable, "max_delta_ns", is defined as an unsigned long
which is a 32-bit integer for 32-bit machines and a 64-bit
integer for 64-bit machines (if -m64 option is used for gcc).
The value of max_delta_ns is set by calling the function
"clockevent_delta2ns()" which returns a maximum value of LONG_MAX.
For a 32-bit machine LONG_MAX is equal to 0x7fffffff and in
nanoseconds this equates to ~2.15 seconds. Hence, the maximum
sleep time for a 32-bit machine is ~2.15 seconds, where as for
a 64-bit machine it will be many years.
This patch changes the type of max_delta_ns to be "u64" instead of
"unsigned long" so that this variable is a 64-bit type for both 32-bit
and 64-bit machines. It also changes the maximum value returned by
clockevent_delta2ns() to KTIME_MAX. Hence this allows a 32-bit
machine to sleep for longer than ~2.15 seconds. Please note that this
patch also changes "min_delta_ns" to be "u64" too and although this is
unnecessary, it makes the patch simpler as it avoids to fixup all
callers of clockevent_delta2ns().
[ tglx: changed "unsigned long long" to u64 as we use this data type
through out the time code ]
Signed-off-by: Jon Hunter <jon-hunter@ti.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1250617512-23567-3-git-send-email-jon-hunter@ti.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The previous patch which limits the sleep time to the maximum
deferment time of the time keeping clocksource has some limitations on
SMP machines: if all CPUs are idle then for all CPUs the maximum sleep
time is limited.
Solve this by keeping track of which cpu had the do_timer() duty
assigned last and limit the sleep time only for this cpu.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Cc: Jon Hunter <jon-hunter@ti.com>
Cc: John Stultz <johnstul@us.ibm.com>
The dynamic tick allows the kernel to sleep for periods longer than a
single tick, but it does not limit the sleep time currently. In the
worst case the kernel could sleep longer than the wrap around time of
the time keeping clock source which would result in losing track of
time.
Prevent this by limiting it to the safe maximum sleep time of the
current time keeping clock source. The value is calculated when the
clock source is registered.
[ tglx: simplified the code a bit and massaged the commit msg ]
Signed-off-by: Jon Hunter <jon-hunter@ti.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1250617512-23567-2-git-send-email-jon-hunter@ti.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On some archs local_softirq_pending() has a data type of unsigned long
on others its unsigned int. Type cast it to (unsigned int) in the
printk to avoid the compiler warning.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
MIPS has two functions to calculcate the mult/shift factors for clock
sources and clock events at run time. ARM needs such functions as
well.
Implement a function which calculates the mult/shift factors based on
the frequencies to which and from which is converted. The function
also has a parameter to specify the minimum conversion range in
seconds. This range is guaranteed not to produce a 64bit overflow when
a value is multiplied with the calculated mult factor. The larger the
conversion range the less becomes the conversion accuracy.
Provide two inline wrappers which handle clock events and clock
sources. For clock events the "from" frequency is nano seconds per
second which corresponds to 1GHz and "to" is the device frequency. For
clock sources "from" is the device frequency and "to" is nano seconds
per second.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mikael Pettersson <mikpe@it.uu.se>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Linus Walleij <linus.walleij@stericsson.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20091111134229.766673305@linutronix.de>
The mult and shift factors of clock events differ in their data type
from those of clock sources for no reason. u32 is sufficient for
both. shift is always <= 32 and mult is limited to 2^32-1 to avoid
64bit multiplication overflows in the conversion.
Preparatory patch for a generic mult/shift factor calculation
function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mikael Pettersson <mikpe@it.uu.se>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Linus Walleij <linus.walleij@stericsson.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20091111134229.725664788@linutronix.de>
Lockdep events subsystem gathers various locking related events
such as a request, release, contention or acquisition of a lock.
The name of this event subsystem is a bit of a misnomer since
these events are not quite related to lockdep but more generally
to locking, ie: these events are not reporting lock dependencies
or possible deadlock scenario but pure locking events.
Hence this rename.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1258103194-843-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
An earlier fix for a race resulted in a situation where the CPUs
other than the CPU that detected the end of the grace period would
not process their callbacks until the next grace period started.
This means that these other CPUs would unnecessarily demand that an
extra grace period be started.
This patch eliminates this extra grace period and speeds callback
processing by propagating rsp->completed to the rcu_node structures
in the case where the CPU detecting the end of the grace period
sees no reason to start a new grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1258094104417-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of only considering SD_WAKE_AFFINE | SD_PREFER_SIBLING
domains also allow all SD_PREFER_SIBLING domains below a
SD_WAKE_AFFINE domain to change the affinity target.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091112145610.909723612@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean up the new affine to idle sibling bits while trying to
grok them. Should not have any function differences.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091112145610.832503781@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Originally task_s/utime() were designed to return clock_t but
later changed to return cputime_t by following commit:
commit efe567fc82
Author: Christian Borntraeger <borntraeger@de.ibm.com>
Date: Thu Aug 23 15:18:02 2007 +0200
It only changed the type of return value, but not the
implementation. As the result the granularity of task_s/utime()
is still that of clock_t, not that of cputime_t.
So using task_s/utime() in __exit_signal() makes values
accumulated to the signal struct to be rounded and coarse
grained.
This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.
v2:
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: xiyou.wangcong@gmail.com
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <4AFB9029.9000208@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kthread_bind(), migrate_task() and sched_fork were missing
updates, and try_to_wake_up() was updating after having already
used the stale clock.
Aside from preventing potential latency hits, there' a side
benefit in that early boot printk time stamps become monotonic.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1258020464.6491.2.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <new-submission>
Now that all of the users stopped using ctl_name and strategy it
is safe to remove the fields from struct ctl_table, and it is safe
to remove the stub strategy routines as well.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name
and .strategy members of sysctl tables are dead code. Remove them.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
A malicious user could have passed in a ctl_name of 0 and triggered
the well know ctl_name to procname mapping code, instead of the wild
card matching code. This is a slight problem as wild card entries don't
have procnames, and because in some alternate universe a network device
might have ifindex 0. So test for and handle wild card entries first.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
dev_get_by_index does not exist when the network stack is not
compiled in, so only include the code to follow wild card paths
when the network stack is present.
I have shuffled the code around a little to make it clear
that dev_put is called after dev_get_by_index showing that
there is no leak.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Disabling interrupts in trace_clock_local takes quite a performance
hit to the recording of traces. Using perf top we see:
------------------------------------------------------------------------------
PerfTop: 244 irqs/sec kernel:100.0% [1000Hz cpu-clock-msecs], (all, 4 CPUs)
------------------------------------------------------------------------------
samples pcnt kernel function
_______ _____ _______________
2842.00 - 40.4% : trace_clock_local
1043.00 - 14.8% : rb_reserve_next_event
784.00 - 11.1% : ring_buffer_lock_reserve
600.00 - 8.5% : __rb_reserve_next
579.00 - 8.2% : rb_end_commit
440.00 - 6.3% : ring_buffer_unlock_commit
290.00 - 4.1% : ring_buffer_producer_thread [ring_buffer_benchmark]
155.00 - 2.2% : debug_smp_processor_id
117.00 - 1.7% : trace_recursive_unlock
103.00 - 1.5% : ring_buffer_event_data
28.00 - 0.4% : do_gettimeofday
22.00 - 0.3% : _spin_unlock_irq
14.00 - 0.2% : native_read_tsc
11.00 - 0.2% : getnstimeofday
Where trace_clock_local is 40% of the tracing, and the time for recording
a trace according to ring_buffer_benchmark is 210ns. After converting
the interrupts to preemption disabling we have from perf top:
------------------------------------------------------------------------------
PerfTop: 1084 irqs/sec kernel:99.9% [1000Hz cpu-clock-msecs], (all, 4 CPUs)
------------------------------------------------------------------------------
samples pcnt kernel function
_______ _____ _______________
1277.00 - 16.8% : native_read_tsc
1148.00 - 15.1% : rb_reserve_next_event
896.00 - 11.8% : ring_buffer_lock_reserve
688.00 - 9.1% : __rb_reserve_next
664.00 - 8.8% : rb_end_commit
563.00 - 7.4% : ring_buffer_unlock_commit
508.00 - 6.7% : _spin_unlock_irq
365.00 - 4.8% : debug_smp_processor_id
321.00 - 4.2% : trace_clock_local
303.00 - 4.0% : ring_buffer_producer_thread [ring_buffer_benchmark]
273.00 - 3.6% : native_sched_clock
122.00 - 1.6% : trace_recursive_unlock
113.00 - 1.5% : sched_clock
101.00 - 1.3% : ring_buffer_event_data
53.00 - 0.7% : tick_nohz_stop_sched_tick
Where trace_clock_local drops from 40% to only taking 4% of the total time.
The trace time also goes from 210ns down to 179ns (31ns).
I talked with Peter Zijlstra about the impact that sched_clock may have
without having interrupts disabled, and he told me that if a timer interrupt
comes in, sched_clock may report a wrong time.
Balancing a seldom incorrect timestamp with a 15% performance boost, I'll
take the performance boost.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that the glibc pthread implemenation no longers uses sysctl() users
of sysctl are as rare as hen's teeth. So remove the glibc exception
from the warning, and use the standard printk_ratelimit instead of
rolling our own.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The ring_buffer_benchmark does a gettimeofday after every write to the
ring buffer in its measurements. This adds the overhead of the call
to gettimeofday to the measurements and does not give an accurate picture
of the length of time it takes to record a trace.
This was first noticed with perf top:
------------------------------------------------------------------------------
PerfTop: 679 irqs/sec kernel:99.9% [1000Hz cpu-clock-msecs], (all, 4 CPUs)
------------------------------------------------------------------------------
samples pcnt kernel function
_______ _____ _______________
1673.00 - 27.8% : trace_clock_local
806.00 - 13.4% : do_gettimeofday
590.00 - 9.8% : rb_reserve_next_event
554.00 - 9.2% : native_read_tsc
431.00 - 7.2% : ring_buffer_lock_reserve
365.00 - 6.1% : __rb_reserve_next
355.00 - 5.9% : rb_end_commit
322.00 - 5.4% : getnstimeofday
268.00 - 4.5% : ring_buffer_unlock_commit
262.00 - 4.4% : ring_buffer_producer_thread [ring_buffer_benchmark]
113.00 - 1.9% : read_tsc
91.00 - 1.5% : debug_smp_processor_id
69.00 - 1.1% : trace_recursive_unlock
66.00 - 1.1% : ring_buffer_event_data
25.00 - 0.4% : _spin_unlock_irq
And the length of each write to the ring buffer measured at 310ns.
This patch adds a new module parameter called "write_interval" which is
defaulted to 50. This is the number of writes performed between
timestamps. After this patch perf top shows:
------------------------------------------------------------------------------
PerfTop: 244 irqs/sec kernel:100.0% [1000Hz cpu-clock-msecs], (all, 4 CPUs)
------------------------------------------------------------------------------
samples pcnt kernel function
_______ _____ _______________
2842.00 - 40.4% : trace_clock_local
1043.00 - 14.8% : rb_reserve_next_event
784.00 - 11.1% : ring_buffer_lock_reserve
600.00 - 8.5% : __rb_reserve_next
579.00 - 8.2% : rb_end_commit
440.00 - 6.3% : ring_buffer_unlock_commit
290.00 - 4.1% : ring_buffer_producer_thread [ring_buffer_benchmark]
155.00 - 2.2% : debug_smp_processor_id
117.00 - 1.7% : trace_recursive_unlock
103.00 - 1.5% : ring_buffer_event_data
28.00 - 0.4% : do_gettimeofday
22.00 - 0.3% : _spin_unlock_irq
14.00 - 0.2% : native_read_tsc
11.00 - 0.2% : getnstimeofday
do_gettimeofday dropped from 13% usage to a mere 0.4%! (using the default
50 interval) The measurement for each timestamp went from 310ns to 210ns.
That's 100ns (1/3rd) overhead that the gettimeofday call was introducing.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing_stats_read() mistakenly returns ENOMEM instead
of the negative value -ENOMEM.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
LKML-Reference: <4AFB2C0B.50605@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
highmem: Fix debug_kmap_atomic() to also handle KM_IRQ_PTE, KM_NMI, and KM_NMI_PTE
highmem: Fix race in debug_kmap_atomic() which could cause warn_count to underflow
rcu: Fix long-grace-period race between forcing and initialization
uids: Prevent tear down race
The ctl_name and strategy fields are unused, now that sys_sysctl
is a compatibility wrapper around /proc/sys. No longer looking
at them in the generic code is effectively what we are doing
now and provides the guarantee that during further cleanups
we can just remove references to those fields and everything
will work ok.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name
and .strategy members of sysctl tables are dead code. Remove them.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Now that the sys_sysctl is now a compatibility wrapper around
/proc/sys we can remove much of sysctl_check and reduce it
to a few remaining sanity checks. This completely decouples
it from the binary sysctl system call.
Little things like ensuring that the sysctl has not already
been registered are all that remain.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Now that sys_sysctl is a compatibility layer on top of /proc/sys
these routines are never called but are still put in sysctl
tables so I have reduced them to stubs until they can be
removed entirely.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
To simply maintenance and to be able to remove all of the binary
sysctl support from various subsystems I have rewritten the binary
sysctl code as a compatibility wrapper around proc/sys.
The code is built around a hard coded table based on the table
in sysctl_check.c that lists all of our current binary sysctls
and provides enough information to convert from the sysctl
binary input into into ascii and back again. New in this
patch is the realization that the only dynamic entries
that need to be handled have ifname as the asscii string
and ifindex as their ctl_name.
When a sys_sysctl is called the code now looks in the
translation table converting the binary name to the
path under /proc where the value is to be found. Opens
that file, and calls into a format conversion wrapper
that calls fop->read and then fop->write as appropriate.
Since in practice the practically no one uses or tests
sys_sysctl rewritting the code to be beautiful is a little
silly. The redeeming merit of this work is it allows us to
rip out all of the binary sysctl syscall support from
everywhere else in the tree. Allowing us to remove
a lot of dead (after this patch) and barely maintained code.
In addition it becomes much easier to optimize the sysctl
implementation for being the backing store of /proc/sys,
without having to worry about sys_sysctl.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The rdp->passed_quiesc_completed fields are used to properly
associate the recorded quiescent state with a grace period. It
is OK to wrongly associate a given quiescent state with a
preceding grace period, but it is fatal to associate a given
quiescent state with a grace period that begins after the
quiescent state occurred. Grace periods are numbered, and the
following fields track them:
o ->gpnum is the number of the grace period currently in
progress, or the number of the last grace period to
complete if no grace period is currently in progress.
o ->completed is the number of the last grace period to
have completed.
These two fields are equal if there is no grace period in
progress, otherwise ->gpnum is one greater than ->completed.
But the rdp->passed_quiesc_completed field compared against
->completed, and if equal, the quiescent state is presumed to
count against the current grace period.
The earlier code copied rdp->completed to
rdp->passed_quiesc_completed, which has been made to work, but
is error-prone. In contrast, copying one less than rdp->gpnum
is guaranteed safe, because rdp->gpnum is not incremented until
after the start of the corresponding grace period. At the end of
the grace period, when ->completed has incremented, then any
quiescent periods recorded previously will be discarded.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12578890421011-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From the code in rt_mutex_setprio(), it is evident that the
intention is that task's with a RT 'prio' value as a consequence
of receiving a PI boost also have their 'sched_class' field set
to '&rt_sched_class'.
However, Peter noticed that the code in __setscheduler() could
result in this intention being frustrated. Fix it.
Reported-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <1257880321.4108.457.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The hw-breakpoint sample module has been broken during the
hw-breakpoint internals refactoring. Propagate the changes
to it.
Reported-by: "K. Prasad" <prasad@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
All of the infrastructure already exists to support read accesses
for platforms that support a read access independently of read/write
(such as in the case of the SuperH UBC). This just trivially hooks
up the read case by itself.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
LKML-Reference: <20091109083733.GA25848@linux-sh.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Commit 1b9508f, "Rate-limit newidle" has been confirmed to fix
the netperf UDP loopback regression reported by Alex Shi.
This is a cleanup and a fix:
- moved to a more out of the way spot
- fix to ensure that balancing doesn't try to balance
runqueues which haven't gone online yet, which can
mess up CPU enumeration during boot.
Reported-by: Alex Shi <alex.shi@intel.com>
Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org> # .32.x: a1f84a3: sched: Check for an idle shared cache
Cc: <stable@kernel.org> # .32.x: 1b9508f: sched: Rate-limit newidle
Cc: <stable@kernel.org> # .32.x: fd21073: sched: Fix affinity logic
Cc: <stable@kernel.org> # .32.x
LKML-Reference: <1257821402.5648.17.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impose a clear locking design on the note_new_gpnum()
function's use of the ->gpnum counter. This is done by updating
rdp->gpnum only from the corresponding leaf rcu_node structure's
rnp->gpnum field, and even then only under the protection of
that same rcu_node structure's ->lock field. Performance and
scalability are maintained using a form of double-checked
locking, and excessive spinning is avoided by use of the
spin_trylock() function. The use of spin_trylock() is safe due
to the fact that CPUs who fail to acquire this lock will try
again later. The hierarchical nature of the rcu_node data
structure limits contention (which could be limited further if
need be using the RCU_FANOUT kernel parameter).
Without this patch, obscure but quite possible races could
result in a quiescent state that occurred during one grace
period to be accounted to the following grace period, causing
this following grace period to end prematurely. Not good!
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: <stable@kernel.org> # .32.x
LKML-Reference: <12571987492350-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impose a clear locking design on the rcu_process_gp_end()
function's use of the ->completed counter. This is done by
creating a ->completed field in the rcu_node structure, which
can safely be accessed under the protection of that structure's
lock. Performance and scalability are maintained by using a
form of double-checked locking, so that rcu_process_gp_end()
only acquires the leaf rcu_node structure's ->lock if a grace
period has recently ended.
This fix reduces rcutorture failure rate by at least two orders
of magnitude under heavy stress with force_quiescent_state()
being invoked artificially often. Without this fix,
unsynchronized access to the ->completed field can cause
rcu_process_gp_end() to advance callbacks whose grace period has
not yet expired. (Bad idea!)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: <stable@kernel.org> # .32.x
LKML-Reference: <12571987494069-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For SELinux to do better filtering in userspace we send the name of the
module along with the AVC denial when a program is denied module_request.
Example output:
type=SYSCALL msg=audit(11/03/2009 10:59:43.510:9) : arch=x86_64 syscall=write success=yes exit=2 a0=3 a1=7fc28c0d56c0 a2=2 a3=7fffca0d7440 items=0 ppid=1727 pid=1729 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=rpc.nfsd exe=/usr/sbin/rpc.nfsd subj=system_u:system_r:nfsd_t:s0 key=(null)
type=AVC msg=audit(11/03/2009 10:59:43.510:9) : avc: denied { module_request } for pid=1729 comm=rpc.nfsd kmod="net-pf-10" scontext=system_u:system_r:nfsd_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=system
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
When the system has too many timers or too many aggregate
queued signals, the EAGAIN error is returned to application
from kernel, including timer_create() [POSIX.1b].
It means that the app exceeded the limit of pending signals,
but in general application writers do not expect this
outcome and the current silent failure can cause rare app
failures under very high load.
This patch adds a new message when we reach the limit
and if print_fatal_signals is enabled:
task/1234: reached RLIMIT_SIGPENDING, dropping signal
If you see this message and your system behaved unexpectedly,
you can run following command to lift the limit:
# ulimit -i unlimited
With help from Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>.
Signed-off-by: Naohiro Ooiwa <nooiwa@miraclelinux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: oleg@redhat.com
LKML-Reference: <4AF6E7E2.9080406@miraclelinux.com>
[ Modified a few small details, gave surrounding code some love. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch was generated by
git grep -E -i -l 's(le|el)ct' | xargs -r perl -p -i -e 's/([Ss])(le|el)ct/$1elect/
with only skipping net/netfilter/xt_SECMARK.c and
include/linux/netfilter/xt_SECMARK.h which have a struct member called
selctx.
Signed-off-by: Uwe Kleine-Knig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The macro used to be used in both trace_selftest.c and
trace_ksym.c, but no longer, so remove it from header file.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Allow or refuse to build a counter using the breakpoints pmu following
given constraints.
We keep track of the pmu users by using three per cpu variables:
- nr_cpu_bp_pinned stores the number of pinned cpu breakpoints counters
in the given cpu
- nr_bp_flexible stores the number of non-pinned breakpoints counters
in the given cpu.
- task_bp_pinned stores the number of pinned task breakpoints in a cpu
The latter is not a simple counter but gathers the number of tasks that
have n pinned breakpoints.
Considering HBP_NUM the number of available breakpoint address
registers:
task_bp_pinned[0] is the number of tasks having 1 breakpoint
task_bp_pinned[1] is the number of tasks having 2 breakpoints
[...]
task_bp_pinned[HBP_NUM - 1] is the number of tasks having the
maximum number of registers (HBP_NUM).
When a breakpoint counter is created and wants an access to the pmu,
we evaluate the following constraints:
== Non-pinned counter ==
- If attached to a single cpu, check:
(per_cpu(nr_bp_flexible, cpu) || (per_cpu(nr_cpu_bp_pinned, cpu)
+ max(per_cpu(task_bp_pinned, cpu)))) < HBP_NUM
-> If there are already non-pinned counters in this cpu, it
means there is already a free slot for them.
Otherwise, we check that the maximum number of per task
breakpoints (for this cpu) plus the number of per cpu
breakpoint (for this cpu) doesn't cover every registers.
- If attached to every cpus, check:
(per_cpu(nr_bp_flexible, *) || (max(per_cpu(nr_cpu_bp_pinned, *))
+ max(per_cpu(task_bp_pinned, *)))) < HBP_NUM
-> This is roughly the same, except we check the number of per
cpu bp for every cpu and we keep the max one. Same for the
per tasks breakpoints.
== Pinned counter ==
- If attached to a single cpu, check:
((per_cpu(nr_bp_flexible, cpu) > 1)
+ per_cpu(nr_cpu_bp_pinned, cpu)
+ max(per_cpu(task_bp_pinned, cpu))) < HBP_NUM
-> Same checks as before. But now the nr_bp_flexible, if any,
must keep one register at least (or flexible breakpoints will
never be be fed).
- If attached to every cpus, check:
((per_cpu(nr_bp_flexible, *) > 1)
+ max(per_cpu(nr_cpu_bp_pinned, *))
+ max(per_cpu(task_bp_pinned, *))) < HBP_NUM
Changes in v2:
- Counter -> event rename
Changes in v5:
- Fix unreleased non-pinned task-bound-only counters. We only released
it in the first cpu. (Thanks to Paul Mackerras for reporting that)
Changes in v6:
- Currently, events scheduling are done in this order: cpu context
pinned + cpu context non-pinned + task context pinned + task context
non-pinned events. Then our current constraints are right theoretically
but not in practice, because non-pinned counters may be scheduled
before we can apply every possible pinned counters. So consider
non-pinned counters as pinned for now.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
This patch rebase the implementation of the breakpoints API on top of
perf events instances.
Each breakpoints are now perf events that handle the
register scheduling, thread/cpu attachment, etc..
The new layering is now made as follows:
ptrace kgdb ftrace perf syscall
\ | / /
\ | / /
/
Core breakpoint API /
/
| /
| /
Breakpoints perf events
|
|
Breakpoints PMU ---- Debug Register constraints handling
(Part of core breakpoint API)
|
|
Hardware debug registers
Reasons of this rewrite:
- Use the centralized/optimized pmu registers scheduling,
implying an easier arch integration
- More powerful register handling: perf attributes (pinned/flexible
events, exclusive/non-exclusive, tunable period, etc...)
Impact:
- New perf ABI: the hardware breakpoints counters
- Ptrace breakpoints setting remains tricky and still needs some per
thread breakpoints references.
Todo (in the order):
- Support breakpoints perf counter events for perf tools (ie: implement
perf_bpcounter_event())
- Support from perf tools
Changes in v2:
- Follow the perf "event " rename
- The ptrace regression have been fixed (ptrace breakpoint perf events
weren't released when a task ended)
- Drop the struct hw_breakpoint and store generic fields in
perf_event_attr.
- Separate core and arch specific headers, drop
asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
- Use new generic len/type for breakpoint
- Handle off case: when breakpoints api is not supported by an arch
Changes in v3:
- Fix broken CONFIG_KVM, we need to propagate the breakpoint api
changes to kvm when we exit the guest and restore the bp registers
to the host.
Changes in v4:
- Drop the hw_breakpoint_restore() stub as it is only used by KVM
- EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a
module
- Restore the breakpoints unconditionally on kvm guest exit:
TIF_DEBUG_THREAD doesn't anymore cover every cases of running
breakpoints and vcpu->arch.switch_db_regs might not always be
set when the guest used debug registers.
(Waiting for a reliable optimization)
Changes in v5:
- Split-up the asm-generic/hw-breakpoint.h moving to
linux/hw_breakpoint.h into a separate patch
- Optimize the breakpoints restoring while switching from kvm guest
to host. We only want to restore the state if we have active
breakpoints to the host, otherwise we don't care about messed-up
address registers.
- Add asm/hw_breakpoint.h to Kbuild
- Fix bad breakpoint type in trace_selftest.c
Changes in v6:
- Fix wrong header inclusion in trace.h (triggered a build
error with CONFIG_FTRACE_SELFTEST
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
In 15934a3732,
field last_tick_seen is added to struct rq.
But it is unused now.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Guillaume Chazarain <guichaz@yahoo.fr>
LKML-Reference: <4AE6A513.6010100@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
root_task_group_empty is used only with FAIR_GROUP_SCHED
so if we use other scheduler options we get:
kernel/sched.c:314: warning: 'root_task_group_empty' defined but not used
So move CONFIG_FAIR_GROUP_SCHED up that it covers
root_task_group_empty().
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091026192414.GB5321@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If a parent directory (ie /proc/irq/<irq>) could not be created
we should not attempt to create subdirectories. Otherwise it
would lead that "smp_affinity" and "spurious" entries are may be
registered under /proc root instead of a proper place.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <20091026202811.GD5321@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix variable name in sched.c kernel-doc notation.
Fixes this DocBook warning:
Warning(kernel/sched.c:2008): No description found for parameter
'p' Warning(kernel/sched.c:2008): Excess function parameter 'k'
description in 'kthread_bind'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <4AF4B1BC.8020604@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While tracing using events with perf, if one enables the
lockdep:lock_acquire event, it will infect every other perf
trace events.
Basically, you can enable whatever set of trace events through
perf but if this event is part of the set, the only result we
can get is a long list of lock_acquire events of rcu read lock,
and only that.
This is because of a recursion inside perf.
1) When a trace event is triggered, it will fill a per cpu
buffer and submit it to perf.
2) Perf will commit this event but will also protect some data
using rcu_read_lock
3) A recursion appears: rcu_read_lock triggers a lock_acquire
event that will fill the per cpu event and then submit the
buffer to perf.
4) Perf detects a recursion and ignores it
5) Perf continues its work on the previous event, but its buffer
has been overwritten by the lock_acquire event, it has then
been turned into a lock_acquire event of rcu read lock
Such scenario also happens with lock_release with
rcu_read_unlock().
We could turn the rcu_read_lock() into __rcu_read_lock() to drop
the lock debugging from perf fast path, but that would make us
lose the rcu debugging and that doesn't prevent from other
possible kind of recursion from perf in the future.
This patch adds a recursion protection based on a counter on the
perf trace per cpu buffers to solve the problem.
-v2: Fixed lost whitespace, added reviewed-by tag
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <1257477185-7838-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that all architechtures are use compat_sys_sysctl and sys32_sysctl
does not exist there is not point in retaining a cond_syscall
entry for it.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This uses compat_alloc_userspace to remove the various
hacks to allow do_sysctl to write to throuh oldlenp.
The rest of our mature compat syscall helper facitilies
are used as well to ensure we have a nice clean maintainable
compat syscall that can be used on all architectures.
The motiviation for a generic compat sysctl (besides the
obvious hack removal) is to reduce the number of compat
sysctl defintions out there so I can refactor the
binary sysctl implementation.
ppc already used the name compat_sys_sysctl so I remove the
ppcs version here.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Read in the binary sysctl path once, instead of reread it
from user space each time the code needs to access a path
element.
The deprecated sysctl warning is moved to do_sysctl so
that the compat_sysctl entries syscalls will also warn.
The return of -ENOSYS when !CONFIG_SYSCTL_SYSCALL is moved
to binary_sysctl. Always leaving a do_sysctl available
that handles !CONFIG_SYSCTL_SYSCALL and printing the
deprecated sysctl warning allows for a single defitition
of the sysctl syscall.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
In preparation for more invasive cleanups separate the core
binary sysctl logic into it's own file.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c
sched: Disable SD_PREFER_LOCAL at node level
sched: Fix boot crash by zalloc()ing most of the cpu masks
sched: Strengthen buddies and mitigate buddy induced latencies
Allow the architecture to request a normal jiffy tick when the system
goes idle and tick_nohz_stop_sched_tick is called . On s390 the hook is
used to prevent the system going fully idle if there has been an
interrupt other than a clock comparator interrupt since the last wakeup.
On s390 the HiperSockets response time for 1 connection ping-pong goes
down from 42 to 34 microseconds. The CPU cost decreases by 27%.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
LKML-Reference: <20090929122533.402715150@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On a system with NOHZ=y tick_check_idle calls tick_nohz_stop_idle and
tick_nohz_update_jiffies. Given the right conditions (ts->idle_active
and/or ts->tick_stopped) both function get a time stamp with ktime_get.
The same time stamp can be reused if both function require one.
On s390 this change has the additional benefit that gcc inlines the
tick_nohz_stop_idle function into tick_check_idle. The number of
instructions to execute tick_check_idle drops from 225 to 144
(without the ktime_get optimization it is 367 vs 215 instructions).
before:
0) | tick_check_idle() {
0) | tick_nohz_stop_idle() {
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.601 us | }
0) 1.765 us | }
0) 3.047 us | }
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.570 us | }
0) 1.727 us | }
0) | tick_do_update_jiffies64() {
0) 0.609 us | }
0) 8.055 us | }
after:
0) | tick_check_idle() {
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.617 us | }
0) 1.773 us | }
0) | tick_do_update_jiffies64() {
0) 0.593 us | }
0) 4.477 us | }
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090929122533.206589318@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When "allocate_resource(root, new, size, ...)" fails, we currently
clobber "new". This is inconvenient for the caller, who might care
about the original contents of the resource.
For example, when pci_bus_alloc_resource() fails, the "can't allocate
mem resource %pR" message from pci_assign_resources() currently contains
junk for the resource start/end.
This patch delays the "new" update until we're about to return success.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Rate limit newidle to migration_cost. It's a win for all
stages of sysbench oltp tests.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When waking affine, check for an idle shared cache, and if
found, wake to that CPU/sibling instead of the waker's CPU.
This improves pgsql+oltp ramp up by roughly 8%. Possibly more
for other loads, depending on overlap. The trade-off is a
roughly 1% peak downturn if tasks are truly synchronous.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
LKML-Reference: <1256654138.17752.7.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
commit 74296a8ed added this function for debug purposes, but it was
never used for anything. Remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix docbook comments to match the actual function names
(set_irq_msi, handle_percpu_irq).
Signed-off-by: Liuwenyi <qingshenlwy@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently partition_sched_domains() takes a 'struct cpumask
*doms_new' which is a kmalloc'ed array of cpumask_t. You can't
have such an array if 'struct cpumask' is undefined, as we plan
for CONFIG_CPUMASK_OFFSTACK=y.
So, we make this an array of cpumask_var_t instead: this is the
same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires
multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case.
Hence we add alloc_sched_domains() and free_sched_domains()
functions.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <200911031453.40668.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
find_lowest_rq() wants to call pick_optimal_cpu() on the
intersection of sched_domain_span(sd) and lowest_mask. Rather
than doing a cpus_and into a temporary, we can open-code it.
This actually makes the code slightly clearer, IMHO.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <200911031453.15350.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rename Kprobes-based event tracer to kprobes-based tracing event
(kprobe-event), since it is not a tracer but an extensible
tracing event interface.
This also changes CONFIG_KPROBE_TRACER to CONFIG_KPROBE_EVENT
and sets it y by default.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
LKML-Reference: <20091104001247.3454.14131.stgit@harusame>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
tools/perf/Makefile
Merge reason: Resolve the conflict, merge to upstream and merge in
perf fixes so we can add a dependent patch.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cpu_nr_migrations() is not used, remove it.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4AF12A66.6020609@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a command is passed to the set_ftrace_filter, then
the ftrace_regex_lock is still held going back to user space.
# echo 'do_open : foo' > set_ftrace_filter
(still holding ftrace_regex_lock when returning to user space!)
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4AEF7F8A.3080300@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We got a sudden panic when we reduced the size of the
ringbuffer.
We can reproduce the panic by the following steps:
echo 1 > events/sched/enable
cat trace_pipe > /dev/null &
while ((1))
do
echo 12000 > buffer_size_kb
echo 512 > buffer_size_kb
done
(not more than 5 seconds, panic ...)
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4AF01735.9060409@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A simple callback in a perf event can be used for multiple purposes.
For example it is useful for triggered based events like hardware
breakpoints that need a callback to dispatch a triggered breakpoint
event.
v2: Simplify a bit the callback attribution as suggested by Paul
Mackerras
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "K.Prasad" <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mundt <lethal@linux-sh.org>
There are reasons for kernel code to ask for, and use, performance
counters.
For example, in CPU freq governors this tends to be a good idea, but
there are other examples possible as well of course.
This patch adds the needed bits to do enable this functionality; they
have been tested in an experimental cpufreq driver that I'm working on,
and the changes are all that I needed to access counters properly.
[fweisbec@gmail.com: added pid to perf_event_create_kernel_counter so
that we can profile a particular task too
TODO: Have a better error reporting, don't just return NULL in fail
case.]
v2: Remove the wrong comment about the fact
perf_event_create_kernel_counter must be called from a kernel
thread.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: "K.Prasad" <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Avi Kivity <avi@redhat.com>
LKML-Reference: <20090925122556.2f8bd939@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM: Remove some debug messages producing too much noise
PM: Fix warning on suspend errors
PM / Hibernate: Add newline to load_image() fail path
PM / Hibernate: Fix error handling in save_image()
PM / Hibernate: Fix blkdev refleaks
PM / yenta: Split resume into early and late parts (rev. 4)
nr_processes() returns the sum of the per cpu counter process_counts for
all online CPUs. This counter is incremented for the current CPU on
fork() and decremented for the current CPU on exit(). Since a process
does not necessarily fork and exit on the same CPU the process_count for
an individual CPU can be either positive or negative and effectively has
no meaning in isolation.
Therefore calculating the sum of process_counts over only the online
CPUs omits the processes which were started or stopped on any CPU which
has since been unplugged. Only the sum of process_counts across all
possible CPUs has meaning.
The only caller of nr_processes() is proc_root_getattr() which
calculates the number of links to /proc as
stat->nlink = proc_root.nlink + nr_processes();
You don't have to be all that unlucky for the nr_processes() to return a
negative value leading to a negative number of links (or rather, an
apparently enormous number of links). If this happens then you can get
failures where things like "ls /proc" start to fail because they got an
-EOVERFLOW from some stat() call.
Example with some debugging inserted to show what goes on:
# ps haux|wc -l
nr_processes: CPU0: 90
nr_processes: CPU1: 1030
nr_processes: CPU2: -900
nr_processes: CPU3: -136
nr_processes: TOTAL: 84
proc_root_getattr. nlink 12 + nr_processes() 84 = 96
84
# echo 0 >/sys/devices/system/cpu/cpu1/online
# ps haux|wc -l
nr_processes: CPU0: 85
nr_processes: CPU2: -901
nr_processes: CPU3: -137
nr_processes: TOTAL: -953
proc_root_getattr. nlink 12 + nr_processes() -953 = -941
75
# stat /proc/
nr_processes: CPU0: 84
nr_processes: CPU2: -901
nr_processes: CPU3: -137
nr_processes: TOTAL: -954
proc_root_getattr. nlink 12 + nr_processes() -954 = -942
File: `/proc/'
Size: 0 Blocks: 0 IO Block: 1024 directory
Device: 3h/3d Inode: 1 Links: 4294966354
Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2009-11-03 09:06:55.000000000 +0000
Modify: 2009-11-03 09:06:55.000000000 +0000
Change: 2009-11-03 09:06:55.000000000 +0000
I'm not 100% convinced that the per_cpu regions remain valid for offline
CPUs, although my testing suggests that they do. If not then I think the
correct solution would be to aggregate the process_count for a given CPU
into a global base value in cpu_down().
This bug appears to pre-date the transition to git and it looks like it
may even have been present in linux-2.6.0-test7-bk3 since it looks like
the code Rusty patched in http://lwn.net/Articles/64773/ was already
wrong.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Finish a line by \n when load_image fails in the middle of loading.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
There are too many retval variables in save_image(). Thus error return
value from snapshot_read_next() may be ignored and only part of the
snapshot (successfully) written.
Remove 'error' variable, invert the condition in the do-while loop
and convert the loop to use only 'ret' variable.
Switch the rest of the function to consider only 'ret'.
Also make sure we end printed line by \n if an error occurs.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
While cruising through the swsusp code I found few blkdev reference
leaks of resume_bdev.
swsusp_read: remove blkdev_put altogether. Some fail paths do
not do that.
swsusp_check: make sure we always put a reference on fail paths
software_resume: all fail paths between swsusp_check and swsusp_read
omit swsusp_close. Add it in those cases. And since
swsusp_read doesn't drop the reference anymore, do
it here unconditionally.
[rjw: Fixed a small coding style issue.]
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Eric Paris reported that commit
f685ceacab causes boot time
PREEMPT_DEBUG complaints.
[ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
[ 4.593043] caller is task_hot+0x86/0xd0
Since kthread_bind() messes with scheduler internals, move the
body to sched.c, and lock the runqueue.
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Tested-by: Eric Paris <eparis@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1256813310.7574.3.camel@marge.simson.net>
[ v2: fix !SMP build and clean up ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf tools: Remove -Wcast-align
perf tools: Fix compatibility with libelf 0.8 and autodetect
perf events: Don't generate events for the idle task when exclude_idle is set
perf events: Fix swevent hrtimer sampling by keeping track of remaining time when enabling/disabling swevent hrtimers
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Remove cpu arg from the rb_time_stamp() function
tracing: Fix comment typo and documentation example
tracing: Fix trace_seq_printf() return value
tracing: Update *ppos instead of filp->f_pos
For as long as kretprobes have existed, we've allocated NR_CPUS
instances of kretprobe_instance structures. With the default
value of CONFIG_NR_CPUS increasing on certain architectures, we
are potentially wasting kernel memory.
See http://sourceware.org/bugzilla/show_bug.cgi?id=10839#c3 for
more details.
Use a saner num_possible_cpus() instead of NR_CPUS for
allocation.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: fweisbec@gmail.com
LKML-Reference: <20091030135310.GA22230@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, rcu_irq_exit() is invoked only for CONFIG_NO_HZ,
while rcu_irq_enter() is invoked unconditionally. This patch
moves rcu_irq_exit() out from under CONFIG_NO_HZ so that the
calls are balanced.
This patch has no effect on the behavior of the kernel because
both rcu_irq_enter() and rcu_irq_exit() are empty for
!CONFIG_NO_HZ, but the code is easier to understand if the calls
are obviously balanced in all cases.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12567428891605-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Very long RCU read-side critical sections (50 milliseconds or
so) can cause a race between force_quiescent_state() and
rcu_start_gp() as follows on kernel builds with multi-level
rcu_node hierarchies:
1. CPU 0 calls force_quiescent_state(), sees that there is a
grace period in progress, and acquires ->fsqlock.
2. CPU 1 detects the end of the grace period, and so
cpu_quiet_msk_finish() sets rsp->completed to rsp->gpnum.
This operation is carried out under the root rnp->lock,
but CPU 0 has not yet acquired that lock. Note that
rsp->signaled is still RCU_SAVE_DYNTICK from the last
grace period.
3. CPU 1 calls rcu_start_gp(), but no one wants a new grace
period, so it drops the root rnp->lock and returns.
4. CPU 0 acquires the root rnp->lock and picks up rsp->completed
and rsp->signaled, then drops rnp->lock. It then enters the
RCU_SAVE_DYNTICK leg of the switch statement.
5. CPU 2 invokes call_rcu(), and now needs a new grace period.
It calls rcu_start_gp(), which acquires the root rnp->lock, sets
rsp->signaled to RCU_GP_INIT (too bad that CPU 0 is already in
the RCU_SAVE_DYNTICK leg of the switch statement!) and starts
initializing the rcu_node hierarchy. If there are multiple
levels to the hierarchy, it will drop the root rnp->lock and
initialize the lower levels of the hierarchy.
6. CPU 0 notes that rsp->completed has not changed, which permits
both CPU 2 and CPU 0 to try updating it concurrently. If CPU 0's
update prevails, later calls to force_quiescent_state() can
count old quiescent states against the new grace period, which
can in turn result in premature ending of grace periods.
Not good.
This patch adds an RCU_GP_IDLE state for rsp->signaled that is
set initially at boot time and any time a grace period ends.
This prevents CPU 0 from getting into the workings of
force_quiescent_state() in step 4. Additional locking and
checks prevent the concurrent update of rsp->signaled in step 6.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1256742889199-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo triggered the following warning:
WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50()
Hardware name: System Product Name
ODEBUG: init active object type: timer_list
Modules linked in:
Pid: 2619, comm: dmesg Tainted: G W 2.6.32-rc5-tip+ #5298
Call Trace:
[<81035443>] warn_slowpath_common+0x6a/0x81
[<8120e483>] ? debug_print_object+0x42/0x50
[<81035498>] warn_slowpath_fmt+0x29/0x2c
[<8120e483>] debug_print_object+0x42/0x50
[<8120ec2a>] __debug_object_init+0x279/0x2d7
[<8120ecb3>] debug_object_init+0x13/0x18
[<810409d2>] init_timer_key+0x17/0x6f
[<81041526>] free_uid+0x50/0x6c
[<8104ed2d>] put_cred_rcu+0x61/0x72
[<81067fac>] rcu_do_batch+0x70/0x121
debugobjects warns about an enqueued timer being initialized. If
CONFIG_USER_SCHED=y the user management code uses delayed work to
remove the user from the hash table and tear down the sysfs objects.
free_uid is called from RCU and initializes/schedules delayed work if
the usage count of the user_struct is 0. The init/schedule happens
outside of the uidhash_lock protected region which allows a concurrent
caller of find_user() to reference the about to be destroyed
user_struct w/o preventing the work from being scheduled. If the next
free_uid call happens before the work timer expired then the active
timer is initialized and the work scheduled again.
The race was introduced in commit 5cb350ba (sched: group scheduling,
sysfs tunables) and made more prominent by commit 3959214f (sched:
delayed cleanup of user_struct)
Move the init/schedule_delayed_work inside of the uidhash_lock
protected region to prevent the race.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: stable@kernel.org
I got a boot crash when forcing cpumasks offstack on 32 bit,
because find_new_ilb() returned 3 on my UP system (nohz.cpu_mask
wasn't zeroed).
AFAICT the others need to be zeroed too: only
nohz.ilb_grp_nohz_mask is initialized before use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <200911022037.21282.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
____ftrace_check_##name() is used for compile-time check on
F_printk() only, so it should be marked as __unused instead
of __used.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <4AEE2D01.4010305@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Today's linux-next build (x86_64 allmodconfig) failed like this:
kernel/user-return-notifier.c: In function
'fire_user_return_notifiers': kernel/user-return-notifier.c:45:
error: expected expression before ')' token
Introduced by commit 7c68af6e32
("core, x86: Add user return notifiers") from the tip and kvm trees
but revealed by commit e0fdb0e050
("percpu: add __percpu for sparse") from the percpu tree.
Before that percpu tree commit, "put_cpu_var()" would compile
without error (even though it really needs a parameter).
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
LKML-Reference: <20091102161722.eea4358d.sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
sched: move rq_weight data array out of .percpu
percpu: allow pcpu_alloc() to be called with IRQs off
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-param-fixes:
param: fix setting arrays of bool
param: fix NULL comparison on oom
param: fix lots of bugs with writing charp params from sysfs, by leaking mem.
* 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
HWPOISON: fix invalid page count in printk output
HWPOISON: Allow schedule_on_each_cpu() from keventd
HWPOISON: fix/proc/meminfo alignment
HWPOISON: fix oops on ksm pages
HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page
HWPOISON: return early on non-LRU pages
HWPOISON: Add brief hwpoison description to Documentation
HWPOISON: Clean up PR_MCE_KILL interface
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
futex: Move drop_futex_key_refs out of spinlock'ed region
rcu: Fix TREE_PREEMPT_RCU CPU_HOTPLUG bad-luck hang
rcu: Stopgap fix for synchronize_rcu_expedited() for TREE_PREEMPT_RCU
rcu: Prevent RCU IPI storms in presence of high call_rcu() load
futex: Check for NULL keys in match_futex
futex: Handle spurious wake up
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf timechart: Improve the visual appearance of scheduler delays
perf timechart: Fix the wakeup-arrows that point to non-visible processes
perf top: Fix --delay_secs 0 division by zero
perf tools: Bump version to 0.0.2
perf_event: Adjust frequency and unthrottle for non-group-leader events
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Do less agressive buddy clearing
sched: Disable SD_PREFER_LOCAL for MC/CPU domains
Having ->procname but not ->proc_handler is valid when PROC_SYSCTL=n,
people use such combination to reduce ifdefs with non-standard handlers.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14408
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: Peter Teoh <htmldeveloper@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 02b51df1b0 (proc connector: add
event for process becoming session leader) we have the following warning:
Badness at kernel/softirq.c:143
[...]
Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0)
[...]
Call Trace:
([<000000013fe04100>] 0x13fe04100)
[<000000000048a946>] sk_filter+0x9a/0xd0
[<000000000049d938>] netlink_broadcast+0x2c0/0x53c
[<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0
[<00000000003baef0>] proc_sid_connector+0xc4/0xd4
[<0000000000142604>] __set_special_pids+0x58/0x90
[<0000000000159938>] sys_setsid+0xb4/0xd8
[<00000000001187fe>] sysc_noemu+0x10/0x16
[<00000041616cb266>] 0x41616cb266
The warning is
---> WARN_ON_ONCE(in_irq() || irqs_disabled());
The network code must not be called with disabled interrupts but
sys_setsid holds the tasklist_lock with spinlock_irq while calling the
connector.
After a discussion we agreed that we can move proc_sid_connector from
__set_special_pids to sys_setsid.
We also agreed that it is sufficient to change the check from
task_session(curr) != pid into err > 0, since if we don't change the
session, this means we were already the leader and return -EPERM.
One last thing:
There is also daemonize(), and some people might want to get a
notification in that case. Since daemonize() is only needed if a user
space does kernel_thread this does not look important (and there seems
to be no consensus if this connector should be called in daemonize). If
we really want this, we can add proc_sid_connector to daemonize() in an
additional patch (Scott?)
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Scott James Remnant <scott@ubuntu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the return from alloc_percpu is compatible with the address
of per-cpu vars, it makes sense to hand around the address of per-cpu
variables. To make this sane, we remove the per_cpu__ prefix we used
created to stop people accidentally using these vars directly.
Now we have sparse, we can use that (next patch).
tj: * Updated to convert stuff which were missed by or added after the
original patch.
* Kill per_cpu_var() macro.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
This patch updates percpu related symbols in kernel tracer such that
percpu symbols are unique and don't clash with local symbols. This
serves two purposes of decreasing the possibility of global percpu
symbol collision and allowing dropping per_cpu__ prefix from percpu
symbols.
* kernel/trace/trace.c: s/max_data/max_tr_data/
* kernel/trace/trace_hw_branches: s/tracer/hwb_tracer/, s/buffer/hwb_buffer/
Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
which cause name clashes" patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
This patch updates percpu related symbols under kernel/ and mm/ such
that percpu symbols are unique and don't clash with local symbols.
This serves two purposes of decreasing the possibility of global
percpu symbol collision and allowing dropping per_cpu__ prefix from
percpu symbols.
* kernel/lockdep.c: s/lock_stats/cpu_lock_stats/
* kernel/sched.c: s/init_rq_rt/init_rt_rq_var/ (any better idea?)
s/sched_group_cpus/sched_groups/
* kernel/softirq.c: s/ksoftirqd/run_ksoftirqd/a
* kernel/softlockup.c: s/(*)_timestamp/softlockup_\1_ts/
s/watchdog_task/softlockup_watchdog/
s/timestamp/ts/ for local variables
* kernel/time/timer_stats: s/lookup_lock/tstats_lookup_lock/
* mm/slab.c: s/reap_work/slab_reap_work/
s/reap_node/slab_reap_node/
* mm/vmstat.c: local variable changed to avoid collision with vmstat_work
Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
which cause name clashes" patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: (slab/vmstat) Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Fix find_probe_event() to compare both of event-name and
event-group. Without this fix, kprobe-tracer overwrites existing
same event-name probe even if its group-name is different.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
LKML-Reference: <20091027204244.30545.27516.stgit@harusame>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We create a dummy struct kernel_param on the stack for parsing each
array element, but we didn't initialize the flags word. This matters
for arrays of type "bool", where the flag indicates if it really is
an array of bools or unsigned int (old-style).
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
kp->arg is always true: it's the contents of that pointer we care about.
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
e180a6b775 "param: fix charp parameters set via sysfs" fixed the case
where charp parameters written via sysfs were freed, leaving drivers
accessing random memory.
Unfortunately, storing a flag in the kparam struct was a bad idea: it's
rodata so setting it causes an oops on some archs. But that's not all:
1) module_param_array() on charp doesn't work reliably, since we use an
uninitialized temporary struct kernel_param.
2) there's a fundamental race if a module uses this parameter and then
it's changed: they will still access the old, freed, memory.
The simplest fix (ie. for 2.6.32) is to never free the memory. This
prevents all these problems, at cost of a memory leak. In practice, there
are only 18 places where a charp is writable via sysfs, and all are
root-only writable.
Reported-by: Takashi Iwai <tiwai@suse.de>
Cc: Sitsofe Wheeler <sitsofe@yahoo.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
Freezing isn't exactly the most latency sensitive operation and
there's no reason to burn cpu cycles and power waiting for it to
complete. msleep(10) instead of yield(). This should improve
reliability of emergency hibernation.
[rjw: Modified the comment next to the msleep(10).]
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr ==
NULL test) nor does it use the wake_list of futex_wake() which where
the reason for commit 41890f2 (futex: Handle spurious wake up)
See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com>
The changes in this fix to the wait_requeue_pi path were considered to
be a likely unecessary, but harmless safety net. But it turns out that
due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined
as EAGAIN we built an endless loop in the code path which returns
correctly EWOULDBLOCK.
Spurious wakeups in wait_requeue_pi code path are unlikely so we do
the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let
it deal with the spurious wakeup.
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
LKML-Reference: <4AE23C74.1090502@us.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This section contains pointers to allocated objects and not scanning it
leads to false positives.
Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This function was taking non-necessary arguments which can be determined
by kmemleak. The patch also modifies the calling sites.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Add two more software events that are common to many cpus.
Alignment faults: When a load or store is not aligned properly.
Emulation faults: When an instruction is emulated in software.
Both cause a very significant slowdown (100x or worse), so identifying and
fixing them is very important.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch creates a synchronize_srcu_expedited() that uses
synchronize_sched_expedited() where synchronize_srcu()
uses synchronize_sched(). The synchronize_srcu() and
synchronize_srcu_expedited() functions become one-liners that
pass synchronize_sched() or synchronize_sched_expedited(),
repectively, to a new __synchronize_srcu() function.
While in the file, move the EXPORT_SYMBOL_GPL()s to immediately
follow the corresponding functions.
Requested-by: Avi Kivity <avi@redhat.com>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: avi@redhat.com
LKML-Reference: <12565226354038-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch is a version of RCU designed for !SMP provided for a
small-footprint RCU implementation. In particular, the
implementation of synchronize_rcu() is extremely lightweight and
high performance. It passes rcutorture testing in each of the
four relevant configurations (combinations of NO_HZ and PREEMPT)
on x86. This saves about 1K bytes compared to old Classic RCU
(which is no longer in mainline), and more than three kilobytes
compared to Hierarchical RCU (updated to 2.6.30):
CONFIG_TREE_RCU:
text data bss dec filename
183 4 0 187 kernel/rcupdate.o
2783 520 36 3339 kernel/rcutree.o
3526 Total (vs 4565 for v7)
CONFIG_TREE_PREEMPT_RCU:
text data bss dec filename
263 4 0 267 kernel/rcupdate.o
4594 776 52 5422 kernel/rcutree.o
5689 Total (6155 for v7)
CONFIG_TINY_RCU:
text data bss dec filename
96 4 0 100 kernel/rcupdate.o
734 24 0 758 kernel/rcutiny.o
858 Total (vs 848 for v7)
The above is for x86. Your mileage may vary on other platforms.
Further compression is possible, but is being procrastinated.
Changes from v7 (http://lkml.org/lkml/2009/10/9/388)
o Apply Lai Jiangshan's review comments (aside from
might_sleep() in synchronize_sched(), which is covered by SMP builds).
o Fix up expedited primitives.
Changes from v6 (http://lkml.org/lkml/2009/9/23/293).
o Forward ported to put it into the 2.6.33 stream.
o Added lockdep support.
o Make lightweight rcu_barrier.
Changes from v5 (http://lkml.org/lkml/2009/6/23/12).
o Ported to latest pre-2.6.32 merge window kernel.
- Renamed rcu_qsctr_inc() to rcu_sched_qs().
- Renamed rcu_bh_qsctr_inc() to rcu_bh_qs().
- Provided trivial rcu_cpu_notify().
- Provided trivial exit_rcu().
- Provided trivial rcu_needs_cpu().
- Fixed up the rcu_*_enter/exit() functions in linux/hardirq.h.
o Removed the dependence on EMBEDDED, with a view to making
TINY_RCU default for !SMP at some time in the future.
o Added (trivial) support for expedited grace periods.
Changes from v4 (http://lkml.org/lkml/2009/5/2/91) include:
o Squeeze the size down a bit further by removing the
->completed field from struct rcu_ctrlblk.
o This permits synchronize_rcu() to become the empty function.
Previous concerns about rcutorture were unfounded, as
rcutorture correctly handles a constant value from
rcu_batches_completed() and rcu_batches_completed_bh().
Changes from v3 (http://lkml.org/lkml/2009/3/29/221) include:
o Changed rcu_batches_completed(), rcu_batches_completed_bh()
rcu_enter_nohz(), rcu_exit_nohz(), rcu_nmi_enter(), and
rcu_nmi_exit(), to be static inlines, as suggested by David
Howells. Doing this saves about 100 bytes from rcutiny.o.
(The numbers between v3 and this v4 of the patch are not directly
comparable, since they are against different versions of Linux.)
Changes from v2 (http://lkml.org/lkml/2009/2/3/333) include:
o Fix whitespace issues.
o Change short-circuit "||" operator to instead be "+" in order
to fix performance bug noted by "kraai" on LWN.
(http://lwn.net/Articles/324348/)
Changes from v1 (http://lkml.org/lkml/2009/1/13/440) include:
o This version depends on EMBEDDED as well as !SMP, as suggested
by Ingo.
o Updated rcu_needs_cpu() to unconditionally return zero,
permitting the CPU to enter dynticks-idle mode at any time.
This works because callbacks can be invoked upon entry to
dynticks-idle mode.
o Paul is now OK with this being included, based on a poll at
the Kernel Miniconf at linux.conf.au, where about ten people said
that they cared about saving 900 bytes on single-CPU systems.
o Applies to both mainline and tip/core/rcu.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: avi@redhat.com
Cc: mtosatti@redhat.com
LKML-Reference: <12565226351355-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When CONFIG_USER_RETURN_NOTIFIER is set, we need to link
kernel/user-return-notifier.o.
Signed-off-by: Avi Kivity <avi@redhat.com>
LKML-Reference: <1256473485-23109-1-git-send-email-avi@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
CPU time of a guest is always accounted in 'user' time
without concern for the nice value of its counterpart
process although the guest is scheduled under the nice
value.
This patch fixes the defect and accounts cpu time of
a niced guest in 'nice' time as same as a niced process.
And also the patch adds 'guest_nice' to cpuacct. The
value provides niced guest cpu time which is like 'nice'
to 'user'.
The original discussions can be found here:
http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.htmlhttp://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html
Signed-off-by: Ryota Ozaki <ozaki.ryota@gmail.com>
Acked-by: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1256314810-7897-1-git-send-email-ozaki.ryota@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The cpu argument is not used inside the rb_time_stamp() function.
Plus fix a typo.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091023233647.118547500@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Trivial patch to fix a documentation example and to fix a
comment.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091023233646.871719877@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
trace_seq_printf() return value is a little ambiguous. It
currently returns the length of the space available in the
buffer. printf usually returns the amount written. This is not
adequate here, because:
trace_seq_printf(s, "");
is perfectly legal, and returning 0 would indicate that it
failed.
We can always see the amount written by looking at the before
and after values of s->len. This is not quite the same use as
printf. We only care if the string was successfully written to
the buffer or not.
Make trace_seq_printf() return 0 if the trace oversizes the
buffer's free space, 1 otherwise.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091023233646.631787612@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of directly updating filp->f_pos we should update the *ppos
argument. The filp->f_pos gets updated within the file_pos_write()
function called from sys_write().
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091023233646.399670810@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch restores the effectiveness of LAST_BUDDY in preventing
pgsql+oltp from collapsing due to wakeup preemption. It also
switches LAST_BUDDY to exclusively do what it does best, namely
mitigate the effects of aggressive wakeup preemption, which
improves vmark throughput markedly, and restores mysql+oltp
scalability.
Since buddies are about scalability, enable them beginning at the
point where we begin expanding sched_latency, namely
sched_nr_latency. Previously, buddies were cleared aggressively,
which seriously reduced their effectiveness. Not clearing
aggressively however, produces a small drop in mysql+oltp
throughput immediately after peak, indicating that LAST_BUDDY is
actually doing some harm. This is right at the point where X on the
desktop in competition with another load wants low latency service.
Ergo, do not enable until we need to scale.
To mitigate latency induced by buddies, or by a task just missing
wakeup preemption, check latency at tick time.
Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via
CACHE_HOT_BUDDY.
Supporting performance tests:
tip = v2.6.32-rc5-1497-ga525b32
tipx = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies
tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs
(Three run averages except where noted.)
vmark:
------
tip 108466 messages per second
tip+ 125307 messages per second
tip+x 125335 messages per second
tipx 117781 messages per second
2.6.31.3 122729 messages per second
mysql+oltp:
-----------
clients 1 2 4 8 16 32 64 128 256
..........................................................................................
tip 9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08
tip+ 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47
tipx 9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18
2.6.31.3 8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86
pgsql+oltp:
-----------
clients 1 2 4 8 16 32 64 128 256
..........................................................................................
tip 13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35
tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94
tip+x 13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00
tipx 13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84
2.6.31.3 13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Today I got:
[39648.224782] Registered led device: iwl-phy0::TX
[40676.545099] __ratelimit: 246 callbacks suppressed
[40676.545103] abcdef[23675]: segfault at 0 ...
as you can see the ratelimit message contains a function prefix.
Since this is always __ratelimit, this wont help much.
This patch changes __ratelimit and printk_ratelimit to print the
function name that calls ratelimit.
This will pinpoint the responsible function, as long as not several
different places call ratelimit with the same ratelimit state at
the same time. In that case we catch only one random function that
calls ratelimit after the wait period.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Young <hidave.darkstar@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <200910231458.11832.borntraeger@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Getting samples for the idle task is often not interesting, so
don't generate them when exclude_idle is set for the event in
question.
Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <ye8pr8fmlq7.fsf@camel16.daimi.au.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the hrtimer based events work for sysprof.
Whenever a swevent is scheduled out, the hrtimer is canceled.
When it is scheduled back in, the timer is restarted. This
happens every scheduler tick, which means the timer never
expired because it was getting repeatedly restarted over and
over with the same period.
To fix that, save the remaining time when disabling; when
reenabling, use that saved time as the period instead of the
user-specified sampling period.
Also, move the starting and stopping of the hrtimers to helper
functions instead of duplicating the code.
Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com>
LKML-Reference: <ye8vdi7mluz.fsf@camel16.daimi.au.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
tools/perf/Makefile
Merge reason:
- fix the conflict
- pick up the pr_*() infrastructure to queue up dependent patch
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Increase TEST_SUSPEND_SECONDS to 10 so the warning in
suspend_test_finish() doesn't annoy the users of slower systems so much.
Also, make the warning print the suspend-resume cycle time, so that we
know why the warning actually triggered.
Patch prepared during the hacking session at the Kernel Summit in Tokyo.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now when calling schedule_on_each_cpu() from keventd there
is a deadlock because it tries to schedule a work item on the current CPU
too. This happens via lru_add_drain_all() in hwpoison.
Just call the function for the current CPU in this case. This is actually
faster too.
Debugging with Fengguang Wu & Max Asbock
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Conflicts:
kernel/Makefile
kernel/trace/Makefile
kernel/trace/trace.h
samples/Makefile
Merge reason: We need to be uptodate with the perf events development
branch because we plan to rewrite the breakpoints API on top of
perf events.
When requeuing tasks from one futex to another, the reference held
by the requeued task to the original futex location needs to be
dropped eventually.
Dropping the reference may ultimately lead to a call to
"iput_final" and subsequently call into filesystem- specific code -
which may be non-atomic.
It is therefore safer to defer this drop operation until after the
futex_hash_bucket spinlock has been dropped.
Originally-From: Helge Bahmann <hcb@chaoticmind.net>
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: <stable@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
Cc: Sven-Thorsten Dietrich <sdietrich@novell.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <4AD7A298.5040802@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* branch 'tty-fixes'
tty: use the new 'flush_delayed_work()' helper to do ldisc flush
workqueue: add 'flush_delayed_work()' to run and wait for delayed work
tty: Make flush_to_ldisc() locking more robust
If the following sequence of events occurs, then
TREE_PREEMPT_RCU will hang waiting for a grace period to
complete, eventually OOMing the system:
o A TREE_PREEMPT_RCU build of the kernel is booted on a system
with more than 64 physical CPUs present (32 on a 32-bit system).
Alternatively, a TREE_PREEMPT_RCU build of the kernel is booted
with RCU_FANOUT set to a sufficiently small value that the
physical CPUs populate two or more leaf rcu_node structures.
o A task is preempted in an RCU read-side critical section
while running on a CPU corresponding to a given leaf rcu_node
structure.
o All CPUs corresponding to this same leaf rcu_node structure
record quiescent states for the current grace period.
o All of these same CPUs go offline (hence the need for enough
physical CPUs to populate more than one leaf rcu_node structure).
This causes the preempted task to be moved to the root rcu_node
structure.
At this point, there is nothing left to cause the quiescent
state to be propagated up the rcu_node tree, so the current
grace period never completes.
The simplest fix, especially after considering the deadlock
possibilities, is to detect this situation when the last CPU is
offlined, and to set that CPU's ->qsmask bit in its leaf
rcu_node structure. This will cause the next invocation of
force_quiescent_state() to end the grace period.
Without this fix, this hang can be triggered in an hour or so on
some machines with rcutorture and random CPU onlining/offlining.
With this fix, these same machines pass a full 10 hours of this
sort of abuse.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <20091015162614.GA19131@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Américo Wang noticed that we have a locking imbalance in the
error paths of ftrace_profile_set_filter(), causing potential
leakage of event_mutex.
Also clean up other error codepaths related to event_mutex
while at it.
Plus fix an initialized variable in the subsystem filter code.
Reported-by: Américo Wang <xiyou.wangcong@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <2375c9f90910150247u5ccb8e2at58c764e385ffa490@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- Add an ioctl to allocate a filter for a perf event.
- Free the filter when the associated perf event is to be freed.
- Do the filtering in perf_swevent_match().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <4AD69546.8050401@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
"==" will always do a full match, and "~" will do a glob match.
In the future, we may add "=~" for regex match.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <4AD69528.3050309@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change:
for_each_pred
for_each_subsystem
To:
for_each_subsystem
for_each_pred
This change also prepares for later patches.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <4AD69502.8060903@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: to add event filter support we need the following
commits from the tracing tree:
3f6fe06: tracing/filters: Unify the regex parsing helpers
1889d20: tracing/filters: Provide basic regex support
737f453: tracing/filters: Cleanup useless headers
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* branch 'tty-fixes':
tty: use the new 'flush_delayed_work()' helper to do ldisc flush
workqueue: add 'flush_delayed_work()' to run and wait for delayed work
Make flush_to_ldisc properly handle parallel calls
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
oprofile: warn on freeing event buffer too early
oprofile: fix race condition in event_buffer free
lockdep: Use cpu_clock() for lockstat
It basically turns a delayed work into an immediate work, and then waits
for it to finish, thus allowing you to force (and wait for) an immediate
flush of a delayed work.
We'll want to use this in the tty layer to clean up tty_flush_to_ldisc().
Acked-by: Oleg Nesterov <oleg@redhat.com>
[ Fixed to use 'del_timer_sync()' as noted by Oleg ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If userspace tries to perform a requeue_pi on a non-requeue_pi waiter,
it will find the futex_q->requeue_pi_key to be NULL and OOPS.
Check for NULL in match_futex() instead of doing explicit NULL pointer
checks on all call sites. While match_futex(NULL, NULL) returning
false is a little odd, it's still correct as we expect valid key
references.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dinakar Guniguntala <dino@in.ibm.com>
CC: John Stultz <johnstul@us.ibm.com>
Cc: stable@kernel.org
LKML-Reference: <4AD60687.10306@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
set_cmdline_ftrace is a better match against what does this function:
apply a tracer name from the kernel command line.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
We are using strncpy in the wrong way to copy the ftrace_graph_filter
boot param because we pass the buffer size instead of the max string
size it can contain (buffer size - 1). The end result might not be
NULL terminated as we are abusing the max string size.
Lets use strlcpy() instead.
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Serialization of sys_reboot can be done local. The BKL is not
protecting anything else.
LKML-Reference: <20091010153349.405590702@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
"name" is a poor name for a file-global variable. It was used in three
different functions, with no mutual exclusion. But it's just a tiny,
temporary string; let's just move it onto the stack in the functions that
need it. Also use snprintf() just in case.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
LKML-Reference: <20091010153349.113570550@linutronix.de>
Acked-by: Mark Gross <mgross@linux.intel.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
pm_qos_power_open got its lock_kernel() calls from the open() pushdown. A
look at the code shows that the only global resources accessed are
pm_qos_array and "name". pm_qos_array doesn't change (things pointed to
therein do change, but they are atomics and/or are protected by
pm_qos_lock). Accesses to "name" are totally unprotected with or without
the BKL; that will be fixed shortly. The BKL is not helpful here; take it
out.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
LKML-Reference: <20091010153349.071381158@linutronix.de>
Acked-by: Mark Gross <mgross@linux.intel.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Yanmin reported a hackbench regression due to:
> commit de69a80be3
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Thu Sep 17 09:01:20 2009 +0200
>
> sched: Stop buddies from hogging the system
I really liked de69a80b, and it affecting hackbench shows I wasn't
crazy ;-)
So hackbench is a multi-cast, with one sender spraying multiple
receivers, who in their turn don't spray back.
This would be exactly the scenario that patch 'cures'. Previously
we would not clear the last buddy after running the next task,
allowing the sender to get back to work sooner than it otherwise
ought to have been, increasing latencies for other tasks.
Now, since those receivers don't poke back, they don't enforce the
buddy relation, which means there's nothing to re-elect the sender.
Cure this by less agressively clearing the buddy stats. Only clear
buddies when they were not chosen. It should still avoid a buddy
sticking around long after its served its time.
Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Galbraith <efault@gmx.de>
LKML-Reference: <1255084986.8802.46.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Most of the syscalls metadata processing is done from arch.
But these operations are mostly generic accross archs. Especially now
that we have a common variable name that expresses the number of
syscalls supported by an arch: NR_syscalls, the only remaining bits
that need to reside in arch is the syscall nr to addr translation.
v2: Compare syscalls symbols only after the "sys" prefix so that we
avoid spurious mismatches with archs that have syscalls wrappers,
in which case syscalls symbols have "SyS" prefixed aliases.
(Reported by: Heiko Carstens)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
The loop in perf_ctx_adjust_freq checks the frequency of sampling
event counters, and adjusts the event interval and unthrottles the
event if required, and resets the interrupt count for the event.
However, at present it only looks at group leaders.
This means that a sampling event that is not a group leader will
eventually get throttled, once its interrupt count reaches
sysctl_perf_event_sample_rate/HZ --- and that is guaranteed to
happen, if the event is active for long enough, since the interrupt
count never gets reset. Once it is throttled it never gets
unthrottled, so it basically just stops working at that point.
This fixes it by making perf_ctx_adjust_freq use ctx->event_list
rather than ctx->group_list. The existing spin_lock/spin_unlock
around the loop makes it unnecessary to put rcu_read_lock/
rcu_read_unlock around the list_for_each_entry_rcu().
Reported-by: Mark W. Krentel <krentel@cs.rice.edu>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <19157.26731.855609.165622@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I was debuging some module using "function" and "function_graph"
tracers and noticed, that if you load module after you enabled
tracing, the module's hooks will convert only to NOP instructions.
The attached patch enables modules' hooks if there's function trace
allready on, thus allowing to trace module functions.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20091013203425.896285120@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The capabilities syscall has a copy_from_user() call where gcc currently
cannot prove to itself that the copy is always within bounds.
This patch adds a very explicity bound check to prove to gcc that this
copy_from_user cannot overflow its destination buffer.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Morris <jmorris@namei.org>
The futex code does not handle spurious wake up in futex_wait and
futex_wait_requeue_pi.
The code assumes that any wake up which was not caused by futex_wake /
requeue or by a timeout was caused by a signal wake up and returns one
of the syscall restart error codes.
In case of a spurious wake up the signal delivery code which deals
with the restart error codes is not invoked and we return that error
code to user space. That causes applications which actually check the
return codes to fail. Blaise reported that on preempt-rt a python test
program run into a exception trap. -rt exposed that due to a built in
spurious wake up accelerator :)
Solve this by checking signal_pending(current) in the wake up path and
handle the spurious wake up case w/o returning to user space.
Reported-by: Blaise Gassend <blaise@willowgarage.com>
Debugged-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
LKML-Reference: <new-submission>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cciss: Add cciss_allow_hpsa module parameter
cciss: Fix multiple calls to pci_release_regions
blk-settings: fix function parameter kernel-doc notation
writeback: kill space in debugfs item name
writeback: account IO throttling wait as iowait
elv_iosched_store(): fix strstrip() misuse
cfq-iosched: avoid probable slice overrun when idling
cfq-iosched: apply bool value where we return 0/1
cfq-iosched: fix think time allowed for seekers
cfq-iosched: fix the slice residual sign
cfq-iosched: abstract out the 'may this cfqq dispatch' logic
block: use proper BLK_RW_ASYNC in blk_queue_start_tag()
block: Seperate read and write statistics of in_flight requests v2
block: get rid of kblock_schedule_delayed_work()
cfq-iosched: fix possible problem with jiffies wraparound
cfq-iosched: fix issue with rq-rq merging and fifo list ordering
ftrace_cpu_disabled usage in trace_functions_graph.c were left out
during this_cpu_xx conversion in commit 9288f99a causing compile
failure. Convert them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Remove the ftrace_trace_addr() function as only its off-case is
implemented and there are no users of it currently.
But we keep ftrace_graph_addr() off-case, in case someone come to use
the function graph tracer to profit from top-level callers filtering.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Do this rename because set_ftrace is too much generic and not enough
self-explainable as a name.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Meaning receive multiple messages, reducing the number of syscalls and
net stack entry/exit operations.
Next patches will introduce mechanisms where protocols that want to
optimize this operation will provide an unlocked_recvmsg operation.
This takes into account comments made by:
. Paul Moore: sock_recvmsg is called only for the first datagram,
sock_recvmsg_nosec is used for the rest.
. Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
works in the same fashion as the ppoll one.
If the underlying protocol returns a datagram with MSG_OOB set, this
will make recvmmsg return right away with as many datagrams (+ the OOB
one) it has received so far.
. Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
datagrams and then recvmsg returns an error, recvmmsg will return
the successfully received datagrams, store the error and return it
in the next call.
This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
where we will be able to acquire the lock only at batch start and end, not at
every underlying recvmsg call.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every time we set a filter, we leak memory allocated by
postfix_append_operand() and postfix_append_op().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: <stable@kernel.org> # for v2.6.31.x
LKML-Reference: <4AD3D7D9.4070400@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rename probe-common fixed field names to harder conflictable names,
because current 'ip', 'func', and other probe field names are easily in
conflict with user-specified variable names.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
LKML-Reference: <20091007222814.1684.407.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Check whether the argument name is in conflict with other field names
while creating a kprobe through the debugfs interface.
Changes in v3:
- Check strcmp() == 0 instead of !strcmp().
Changes in v2:
- Add common_lock_depth to reserved name list.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
LKML-Reference: <20091007222807.1684.26880.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Rename special variables to more self-explainable names as below:
- $rv to $retval
- $sa to $stack
- $aN to $argN
- $sN to $stackN
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
LKML-Reference: <20091007222759.1684.3319.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Add a command line parameter to allow limiting the function graphs
that are traced on boot up from the given top-level callers , when
ftrace=function_graph is specified.
This patch adds the following command line option:
ftrace_graph_filter=function-list
Where function-list is a comma separated list of functions to filter.
[fweisbec@gmail.com: picked the documentation changes from the v2 patch]
Signed-off-by: Stefan Assmann <sassmann@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4AD2DEB9.2@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Remove '$ra' (return address) because it is already shown at the head of
each entry.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
LKML-Reference: <20091007222748.1684.12711.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Add $ prefix to the special variables(e.g. sa, rv) of kprobe-tracer.
This resolves consistency issues between kprobe_events and perf-kprobe.
The main goal is to avoid conflicts between local variable names of
probed functions, used by perf probe, and special variables used
in the kprobe event creation interface (stack values, etc...) and
also available from perf probe.
ie: we don't want rv (return value) to conflict with a local variable
named rv in a probed function.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
LKML-Reference: <20091007222740.1684.91170.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
this_cpu_xx can reduce the instruction count here and also
avoid address arithmetic.
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Tejun Heo <tj@kernel.org>
The following htmldocs warnings:
Warning(kernel/sched.c:685): No description found for parameter 'cpu'
Warning(kernel/sched.c:3676): No description found for parameter 'sd'
Trigger because new parameters were added to update_rq_clock() and
update_group_power() without updating the kernel-doc notation.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4AD29070.7070002@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
In try_to_wake_up(), we update the runqueue clock, but
select_task_rq() may select a different runqueue than the one we
updated, leaving the new runqueue's clock stale for a bit.
This patch cures occasional huge latencies reported by latencytop
when coming out of idle on a mostly idle NO_HZ box.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1255070103.7639.30.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some tracepoint magic (TRACE_EVENT(lock_acquired)) relies on
the fact that lock hold times are positive and uses div64 on
that. That triggered a build warning on MIPS, and probably
causes bad output in certain circumstances as well.
Make it truly positive.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1254818502.21044.112.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It makes sense to do IOWAIT when someone is blocked
due to IO throttle, as suggested by Kame and Peter.
There is an old comment for not doing IOWAIT on throttle,
however it has been mismatching the code for a long time.
If we stop accounting IOWAIT for 2.6.32, it could be an
undesirable behavior change. So restore the io_schedule.
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The addition of trace_array_{v}printk used the wrong function for
trace_vprintk to call. This broke trace_marker and trace_vprintk
itself. Although trace_printk may not have been affected by those
that end up calling trace_vbprintk.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
futex: fix requeue_pi key imbalance
futex: Fix typo in FUTEX_WAIT/WAKE_BITSET_PRIVATE definitions
rcu: Place root rcu_node structure in separate lockdep class
rcu: Make hot-unplugged CPU relinquish its own RCU callbacks
rcu: Move rcu_barrier() to rcutree
futex: Move exit_pi_state() call to release_mm()
futex: Nullify robust lists after cleanup
futex: Fix locking imbalance
panic: Fix panic message visibility by calling bust_spinlocks(0) before dying
rcu: Replace the rcu_barrier enum with pointer to call_rcu*() function
rcu: Clean up code based on review feedback from Josh Triplett, part 4
rcu: Clean up code based on review feedback from Josh Triplett, part 3
rcu: Fix rcu_lock_map build failure on CONFIG_PROVE_LOCKING=y
rcu: Clean up code to address Ingo's checkpatch feedback
rcu: Clean up code based on review feedback from Josh Triplett, part 2
rcu: Clean up code based on review feedback from Josh Triplett
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Set correct normal_prio and prio values in sched_fork()
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: user local buffer variable for trace branch tracer
tracing: fix warning on kernel/trace/trace_branch.c andtrace_hw_branches.c
ftrace: check for failure for all conversions
tracing: correct module boundaries for ftrace_release
tracing: fix transposed numbers of lock_depth and preempt_count
trace: Fix missing assignment in trace_ctxwake_*
tracing: Use free_percpu instead of kfree
tracing: Check total refcount before releasing bufs in profile_enable failure
* 'sparc-perf-events-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA
perf_event: Provide vmalloc() based mmap() backing
* 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_events: Make ABI definitions available to userspace
perf tools: elf_sym__is_function() should accept "zero" sized functions
tracing/syscalls: Use long for syscall ret format and field definitions
perf trace: Update eval_flag() flags array to match interrupt.h
perf trace: Remove unused code in builtin-trace.c
perf: Propagate term signal to child
Just using the tr->buffer for the API to trace_buffer_lock_reserve
is not good enough. This is because the tr->buffer may change, and we
do not want to commit with a different buffer that we reserved from.
This patch uses a local variable to hold the buffer that was used to
reserve and commit with.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
fix warnings that caused the API change of trace_buffer_lock_reserve()
change files: kernel/trace/trace_hw_branch.c
kernel/trace/trace_branch.c
Signed-off-by: Zhenwen Xu <helight.xu@gmail.com>
LKML-Reference: <20091008012146.GA4170@helight>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Due to legacy code from back when the dynamic tracer used a daemon,
only core kernel code was checking for failures. This is no longer
the case. We must check for failures any time we perform text modifications.
Cc: stable@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the module is about the unload we release its call records.
The ftrace_release function was given wrong values representing
the module core boundaries, thus not releasing its call records.
Plus making ftrace_release function module specific.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1254934835-363-3-git-send-email-jolsa@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If futex_wait_requeue_pi() wakes prior to requeue, we drop the
reference to the source futex_key twice, once in
handle_early_requeue_pi_wakeup() and once on our way out.
Remove the drop from the handle_early_requeue_pi_wakeup() and keep
the get/drops together in futex_wait_requeue_pi().
Reported-by: Helge Bahmann <hcb@chaoticmind.net>
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Helge Bahmann <hcb@chaoticmind.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: stable-2.6.31 <stable@kernel.org>
LKML-Reference: <4ACCE21E.5030805@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit f2e21c9610 had unfortunate side
effects with cpufreq governors on some systems.
If the system did not switch into NOHZ mode ts->inidle is not set when
tick_nohz_stop_sched_tick() is called from the idle routine. Therefor
all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick()
fail to call tick_nohz_start_idle(). This results in bogus idle
accounting information which is passed to cpufreq governors.
Set the inidle flag unconditionally of the NOHZ active state to keep
the idle time accounting correct in any case.
[ tglx: Added comment and tweaked the changelog ]
Reported-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Greg KH <greg@kroah.com>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: stable@kernel.org
LKML-Reference: <1254907901.30157.93.camel@eenurkka-desktop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Before this patch, all of the rcu_node structures were in the same lockdep
class, so that lockdep would complain when rcu_preempt_offline_tasks()
acquired the root rcu_node structure's lock while holding one of the leaf
rcu_nodes' locks.
This patch changes rcu_init_one() to use a separate
spin_lock_init() for the root rcu_node structure's lock than is
used for that of all of the rest of the rcu_node structures, which
puts the root rcu_node structure's lock in its own lockdep class.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12548908983277-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current interaction between RCU and CPU hotplug requires that
RCU block in CPU notifiers waiting for callbacks to drain.
This can be greatly simplified by having each CPU relinquish its
own callbacks, and for both _rcu_barrier() and CPU_DEAD notifiers
to adopt all callbacks that were previously relinquished.
This change also eliminates the possibility of certain types of
hangs due to the previous practice of waiting for callbacks to be
invoked from within CPU notifiers. If you don't every wait, you
cannot hang.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1254890898456-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
exit_pi_state() is called from do_exit() but not from do_execve().
Move it to release_mm() so it gets called from do_execve() as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Cc: stable@kernel.org
Cc: Anirban Sinha <ani@anirban.org>
Cc: Peter Zijlstra <peterz@infradead.org>
The robust list pointers of user space held futexes are kept intact
over an exec() call. When the exec'ed task exits exit_robust_list() is
called with the stale pointer. The risk of corruption is minimal, but
still it is incorrect to keep the pointers valid. Actually glibc
should uninstall the robust list before calling exec() but we have to
deal with it anyway.
Nullify the pointers after [compat_]exit_robust_list() has been
called.
Reported-by: Anirban Sinha <ani@anirban.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Cc: stable@kernel.org
The sign info used for filters in the kernel is also useful to
applications that process the trace stream. Add it to the format
files and make it available to userspace.
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: lizf@cn.fujitsu.com
Cc: hch@infradead.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <1254809398-8078-2-git-send-email-tzanussi@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The state char variable S should be reassigned, if S == 0.
We are missing the state of the task that is going to sleep for the
context switch events (in the raw mode).
Fortunately the problem arises with the sched_switch/wake_up
tracers, not the sched trace events.
The formers are legacy now. But still, that was buggy.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4AC43118.6050409@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some architectures such as Sparc, ARM and MIPS (basically
everything with flush_dcache_page()) need to deal with dcache
aliases by carefully placing pages in both kernel and user maps.
These architectures typically have to use vmalloc_user() for this.
However, on other architectures, vmalloc() is not needed and has
the downsides of being more restricted and slower than regular
allocations.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: David Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1254830228.21044.272.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The syscall event definitions use long for the syscall exit ret
value, but unsigned long for the same thing in the format and field
definitions. Change them all to long.
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: lizf@cn.fujitsu.com
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <1254808849-7829-4-git-send-email-tzanussi@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the comment about calling alloc_bootmem() as it is not
called here since commit 36b7b6d465.
Signed-off-by: Jayson R. King <dev@jaysonking.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <trivial@kernel.org>
LKML-Reference: <4AC9C8A6.6010209@jaysonking.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rich reported a lock imbalance in the futex code:
http://bugzilla.kernel.org/show_bug.cgi?id=14288
It's caused by the displacement of the retry_private label in
futex_wake_op(). The code unlocks the hash bucket locks in the
error handling path and retries without locking them again which
makes the next unlock fail.
Move retry_private so we lock the hash bucket locks when we retry.
Reported-by: Rich Ercolany <rercola@acm.jhu.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: stable-2.6.31 <stable@kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit ffd71da4e3 ("panic: decrease oops_in_progress only after
having done the panic") moved bust_spinlocks(0) to the end of the
function, which in practice is never reached.
As a result console_unblank() is not called, and on some systems
the user may not see the panic message.
Move it back up to before the unblanking.
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1254483680-25578-1-git-send-email-aaro.koskinen@nokia.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf tools: Run generate-cmdlist.sh properly
perf_event: Clean up perf_event_init_task()
perf_event: Fix event group handling in __perf_event_sched_*()
perf timechart: Add a power-only mode
perf top: Add poll_idle to the skip list
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
hrtimer: Remove overly verbose "switch to high res mode" message
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
kmemtrace: Fix up tracer registration
tracing: Fix infinite recursion in ftrace_update_pid_func()
The rcu_barrier enum causes several problems:
(1) you have to define the enum somewhere, and there is no
convenient place,
(2) the difference between TREE_RCU and TREE_PREEMPT_RCU causes
problems when you need to map from rcu_barrier enum to struct
rcu_state,
(3) the switch statement are large, and
(4) TINY_RCU really needs a different rcu_barrier() than do the
treercu implementations.
So replace it with a functionally equivalent but cleaner function
pointer abstraction.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12541998232366-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These issues identified during an old-fashioned face-to-face code
review extending over many hours. This group improves an existing
abstraction and introduces two new ones. It also fixes an RCU
stall-warning bug found while making the other changes.
o Make RCU_INIT_FLAVOR() declare its own variables, removing
the need to declare them at each call site.
o Create an rcu_for_each_leaf() macro that scans the leaf
nodes of the rcu_node tree.
o Create an rcu_for_each_node_breadth_first() macro that does
a breadth-first traversal of the rcu_node tree, AKA
stepping through the array in index-number order.
o If all CPUs corresponding to a given leaf rcu_node
structure go offline, then any tasks queued on that leaf
will be moved to the root rcu_node structure. Therefore,
the stall-warning code must dump out tasks queued on the
root rcu_node structure as well as those queued on the leaf
rcu_node structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12541491934126-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Whitespace fixes, updated comments, and trivial code movement.
o Fix whitespace error in RCU_HEAD_INIT()
o Move "So where is rcu_write_lock()" comment so that it does
not come between the rcu_read_unlock() header comment and
the rcu_read_unlock() definition.
o Move the module_param statements for blimit, qhimark, and
qlowmark to immediately follow the corresponding
definitions.
o In __rcu_offline_cpu(), move the assignment to rdp_me
inside the "if" statement, given that rdp_me is not used
outside of that "if" statement.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12541491931164-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the rcu_lock_map definition from rcutree.c to rcupdate.c so that
TINY_RCU can use lockdep.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the prior logarithmic time accumulation patch, xtime will now
always be within one "tick" of the current time, instead of
possibly half a second off.
This removes the need for the xtime_cache value, which always
stored the time at the last interrupt, so this patch cleans that up
removing the xtime_cache related code.
This is a bit simpler, but still could use some wider testing.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1254525855.7741.95.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Accumulating one tick at a time works well unless we're using NOHZ.
Then it can be an issue, since we may have to run through the loop
a few thousand times, which can increase timer interrupt caused
latency.
The current solution was to accumulate in half-second intervals
with NOHZ. This kept the number of loops down, however it did
slightly change how we make NTP adjustments. While not an issue
with NTPd users, as NTPd makes adjustments over a longer period of
time, other adjtimex() users have noticed the half-second
granularity with which we can apply frequency changes to the clock.
For instance, if a application tries to apply a 100ppm frequency
correction for 20ms to correct a 2us offset, with NOHZ they either
get no correction, or a 50us correction.
Now, there will always be some granularity error for applying
frequency corrections. However with users sensitive to this error
have seen a 50-500x increase with NOHZ compared to running without
NOHZ.
So I figured I'd try another approach then just simply increasing
the interval. My approach is to consume the time interval
logarithmically. This reduces the number of times through the loop
needed keeping latency down, while still preserving the original
granularity error for adjtimex() changes.
Further, this change allows us to remove the xtime_cache code
(patch to follow), as xtime is always within one tick of the
current time, instead of the half-second updates it saw before.
An earlier version of this patch has been shipping to x86 users in
the RedHat MRG releases for awhile without issue, but I've reworked
this version to be even more careful about avoiding possible
overflows if the shift value gets too large.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1254525473.7741.88.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
normal_prio should be updated if policy changes from RT to
SCHED_MORMAL or if static_prio/nice is changed.
Some paths through sched_fork() ignore this requirement and may
result in normal_prio having an invalid value.
Fixing this issue allows the call to effective_prio() in
wake_up_new_task() to be removed.
Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <f8f46736fd4e7f090ac0.1253774830@mudlark.pw.nest>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the event->profile_enable() failure path, we release the per cpu
buffers using kfree which is wrong because they are per cpu pointers.
Although free_percpu only wraps kfree for now, that may change in the
future so lets use the correct way.
Reported-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
When we call the profile_enable() callback of an event, we release the
shared perf event tracing buffers unconditionnaly in the failure path.
This is wrong because there may be other users of these. Then check the
total refcount before doing this.
Reported-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (41 commits)
Revert "Seperate read and write statistics of in_flight requests"
cfq-iosched: don't delay async queue if it hasn't dispatched at all
block: Topology ioctls
cfq-iosched: use assigned slice sync value, not default
cfq-iosched: rename 'desktop' sysfs entry to 'low_latency'
cfq-iosched: implement slower async initiate and queue ramp up
cfq-iosched: delay async IO dispatch, if sync IO was just done
cfq-iosched: add a knob for desktop interactiveness
Add a tracepoint for block request remapping
block: allow large discard requests
block: use normal I/O path for discard requests
swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL
fs/bio.c: move EXPORT* macros to line after function
Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
cciss: fix build when !PROC_FS
block: Do not clamp max_hw_sectors for stacking devices
block: Set max_sectors correctly for stacking devices
cciss: cciss_host_attr_groups should be const
cciss: Dynamically allocate the drive_info_struct for each logical drive.
cciss: Add usage_count attribute to each logical drive in /sys
...
While writing the manpage I noticed some shortcomings in the
current interface.
- Define symbolic names for all the different values
- Boundary check the kill mode values
- For symmetry add a get interface too. This allows library
code to get/set the current state.
- For consistency define a PR_MCE_KILL_DEFAULT value
Signed-off-by: Andi Kleen <ak@linux.intel.com>
RCU does not do dynamic allocations but it increments per cpu variables
a lot. These instructions results in a move to a register and then back
to memory. This patch will make it use the inc/dec instructions on x86
that do not need a register.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Check result of event_create_dir() and add ftrace_event_call to
ftrace_events list only if it is succeeded. Thanks to Li for pointing
it out.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <20090925182054.10157.55219.stgit@omoto>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Use new percpu global event buffer instead of stack in kprobe
tracer while tracing through perf.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <20090925182011.10157.60140.stgit@omoto>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
With ia64 converted, there's no arch left which still uses legacy
percpu allocator. Kill it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Delightedly-acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
This patch clean up/fixes for memcg's uncharge soft limit path.
Problems:
Now, res_counter_charge()/uncharge() handles softlimit information at
charge/uncharge and softlimit-check is done when event counter per memcg
goes over limit. Now, event counter per memcg is updated only when
memory usage is over soft limit. Here, considering hierarchical memcg
management, ancesotors should be taken care of.
Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
This is not good.
Prolems:
1. memcg's event counter incremented only when softlimit hits. That's bad.
It makes event counter hard to be reused for other purpose.
2. At uncharge, only the lowest level rescounter is handled. This is bug.
Because ancesotor's event counter is not incremented, children should
take care of them.
3. res_counter_uncharge()'s 3rd argument is NULL in most case.
ops under res_counter->lock should be small. No "if" sentense is better.
Fixes:
* Removed soft_limit_xx poitner and checks in charge and uncharge.
Do-check-only-when-necessary scheme works enough well without them.
* make event-counter of memcg incremented at every charge/uncharge.
(per-cpu area will be accessed soon anyway)
* All ancestors are checked at soft-limit-check. This is necessary because
ancesotor's event counter may never be modified. Then, they should be
checked at the same time.
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__css_put() doesn't check a bug as refcnt goes to minus.
I think it should be caught. This patch adds a check for it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Starting from commit 4a4962263f "reduce
symbol table for loaded modules (v2)", the kernel/module.c build is broken
with CONFIG_KALLSYMS disabled.
CC kernel/module.o
kernel/module.c:1995: warning: type defaults to 'int' in declaration of 'Elf_Hdr'
kernel/module.c:1995: error: expected ';', ',' or ')' before '*' token
kernel/module.c: In function 'load_module':
kernel/module.c:2203: error: 'strmap' undeclared (first use in this function)
kernel/module.c:2203: error: (Each undeclared identifier is reported only once
kernel/module.c:2203: error: for each function it appears in.)
kernel/module.c:2239: error: 'symoffs' undeclared (first use in this function)
kernel/module.c:2239: error: implicit declaration of function 'layout_symtab'
kernel/module.c:2240: error: 'stroffs' undeclared (first use in this function)
make[1]: *** [kernel/module.o] Error 1
make: *** [kernel/module.o] Error 2
There are three different issues:
- layout_symtab() takes a const Elf_Ehdr
- layout_symtab() needs to return a value
- symoffs/stroffs/strmap are referenced by the load_module() code
despite being ifdefed out, which seems unnecessary given the noop
behaviour of layout_symtab()/add_kallsyms() in the case of
CONFIG_KALLSYMS=n.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 2.6.31 now has request-based device-mapper, it's useful to have
a tracepoint for request-remapping as well as bio-remapping.
This patch adds a tracepoint for request-remapping, trace_block_rq_remap().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Add a general per-cpu notifier that is called whenever the kernel is
about to return to userspace. The notifier uses a thread_info flag
and existing checks, so there is no impact on user return or context
switch fast paths.
This will be used initially to speed up KVM task switching by lazily
updating MSRs.
Signed-off-by: Avi Kivity <avi@redhat.com>
LKML-Reference: <1253342422-13811-1-git-send-email-avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Commit ddc1637af2 ("kmemtrace: Print
binary output only if 'bin' option is set") ended up inverting the
error detection logic. register_tracer() returns 0 on success,
which this change caused to treat as an error, resulting in:
[ 0.132000] Warning: could not register the kmem tracer
as well as bailing out of the initcall with an error value. This
restores the old logic.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <20090928075540.GD6668@linux-sh.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While at it: we can traverse ctx->group_list to get all
group leader, it should be safe since we hold ctx->mutex.
Changlog v1->v2:
- remove WARN_ON_ONCE() according to Peter Zijlstra's suggestion
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4ABC5AF9.6060808@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Mackerras says:
"Actually, looking at this more closely, it has to be a group
leader anyway since it's at the top level of ctx->group_list. In
fact I see four places where we do:
list_for_each_entry(event, &ctx->group_list, group_entry) {
if (event == event->group_leader)
...
or the equivalent, three of which appear to have been introduced
by afedadf2 ("perf_counter: Optimize sched in/out of counters")
back in May by Peter Z.
As far as I can see the if () is superfluous in each case (a
singleton event will be a group of 1 and will have its
group_leader pointing to itself)."
[ See: http://marc.info/?l=linux-kernel&m=125361238901442&w=2 ]
And Peter Zijlstra points out this is a bugfix:
"The intent was to call event_sched_{in,out}() for single event
groups because that's cheaper than group_sched_{in,out}(),
however..
- as you noticed, I got the condition wrong, it should have read:
list_empty(&event->sibling_list)
- it failed to call group_can_go_on() which deals with ->exclusive.
- it also doesn't call hw_perf_group_sched_in() which might break
power."
[ See: http://marc.info/?l=linux-kernel&m=125369523318583&w=2 ]
Changelog v1->v2:
- Fix the title name according to Peter Zijlstra's suggestion
- Remove the comments and WARN_ON_ONCE() as Peter Zijlstra's
suggestion
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4ABC5A55.7000208@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST is enabled
__ftrace_trace_function contains the current trace function, not
ftrace_trace_function.
In ftrace_update_pid_func() we currently incorrectly assign the
value of ftrace_trace_function to __ftrace_trace_funcion before
returning.
Without this patch it is possible to execute an infinite recursion
whereby ftrace_test_stop_func() calls __ftrace_trace_function,
which was assigned ftrace_test_stop_func() in
ftrace_update_pid_func().
Signed-off-by: Matt Fleming <matthew.fleming@imgtec.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1254152581-18347-1-git-send-email-matt@console-pimps.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit def0a9b257 (sched_clock: Make it NMI safe) assumed
cmpxchg() of 64bit values was available on X86_32.
That is not so - and causes some subtle scheduler misbehavior due
to incorrect timestamps off to up by ~4 seconds.
Two symptoms are known right now:
- interactivity problems seen by Arjan: up to 600 msecs
latencies instead of the expected 20-40 msecs. These
latencies are very visible on the desktop.
- incorrect CPU stats: occasionally too high percentages in 'top',
and crazy CPU usage stats.
Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090930170754.0886ff2e@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clocksource: Resume clocksource without taking the clocksource mutex
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
modules, tracing: Remove stale struct marker signature from module_layout()
tracing/workqueue: Use %pf in workqueue trace events
tracing: Fix a comment and a trivial format issue in tracepoint.h
tracing: Fix failure path in ftrace_regex_open()
tracing: Fix failure path in ftrace_graph_write()
tracing: Check the return value of trace_get_user()
tracing: Fix off-by-one in trace_get_user()
On big systems, printing <number of CPUs> copies of
Switched to high resolution mode on CPU nnn
clutters up the kernel log for minimal gain. Just get rid of them.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
LKML-Reference: <ada1vlw126s.fsf_-_@cisco.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
git commit 75c5158f70 converted the clocksource spinlock to a
mutex. This causes the following BUG:
BUG: sleeping function called from invalid context at
kernel/mutex.c:280 in_atomic(): 0, irqs_disabled(): 1, pid: 2473,
name: pm-suspend 2 locks held by pm-suspend/2473:
#0: (&buffer->mutex){......}, at: [<ffffffff8115ab13>]
sysfs_write_file+0x3c/0x137
#1: (pm_mutex){......}, at: [<ffffffff810865b5>]
enter_state+0x39/0x130 Pid: 2473, comm: pm-suspend Not tainted 2.6.31
#1 Call Trace:
[<ffffffff810792f0>] ? __debug_show_held_locks+0x22/0x24
[<ffffffff8104a2ef>] __might_sleep+0x107/0x10b
[<ffffffff8141fca9>] mutex_lock_nested+0x25/0x43
[<ffffffff81073537>] clocksource_resume+0x1c/0x60
[<ffffffff81072902>] timekeeping_resume+0x1e/0x1c8
[<ffffffff812aee62>] __sysdev_resume+0x25/0xcf
[<ffffffff812aef79>] sysdev_resume+0x6d/0xae
[<ffffffff810864f8>] suspend_devices_and_enter+0x12b/0x1af
[<ffffffff8108665b>] enter_state+0xdf/0x130
[<ffffffff81085dc3>] state_store+0xb6/0xd3
[<ffffffff81204c73>] kobj_attr_store+0x17/0x19
[<ffffffff8115abd2>] sysfs_write_file+0xfb/0x137
[<ffffffff811057d2>] vfs_write+0xae/0x10b
[<ffffffff81208392>] ? __up_read+0x1a/0x7f
[<ffffffff811058ef>] sys_write+0x4a/0x6e
[<ffffffff81011b82>] system_call_fastpath+0x16/0x1b
clocksource_resume is called early in the resume process, there is
only one cpu, no processes are running and the interrupts are
disabled. It is therefore possible to resume the clocksources
without taking the clocksource mutex.
Reported-by: Xiaotian Feng <xtfeng@gmail.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Tested-by: Michal Schmidt <mschmidt@redhat.com>
Cc: Xiaotian Feng <xtfeng@gmail.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090924172952.49697825@mschwide.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The memory barrier semantics of futex_wait_queue_me() are
non-obvious. Add some commentary to try and clarify it.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090924185447.694.38948.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The filter code has stolen the regex parsing function from ftrace to
get the regex support.
We have duplicated this code, so factorize it in the filter area and
make it generally available, as the filter code is the most suited to
host this feature.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
This patch provides basic support for regular expressions in filters.
It supports the following types of regexp:
- *match_beginning
- *match_middle*
- match_end*
- !don't match
Example:
cd /debug/tracing/events/bkl/lock_kernel
echo 'file == "*reiserfs*"' > filter
echo 1 > enable
gedit-4941 [000] 457.735437: lock_kernel: depth: 0, fs/reiserfs/namei.c:334 reiserfs_lookup()
sync_supers-227 [001] 461.379985: lock_kernel: depth: 0, fs/reiserfs/super.c:69 reiserfs_sync_fs()
sync_supers-227 [000] 461.383096: lock_kernel: depth: 0, fs/reiserfs/journal.c:1069 flush_commit_list()
reiserfs/1-1369 [001] 461.479885: lock_kernel: depth: 0, fs/reiserfs/journal.c:3509 flush_async_commits()
Every string is now handled as a regexp in the filter framework, which
helps to factorize the code for handling both simple strings and
regexp comparisons.
(The regexp parsing code has been wildly cherry picked from ftrace.c
written by Steve.)
v2: Simplify the whole and drop the filter_regex file
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
* 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze: (24 commits)
microblaze: Disable heartbeat/enable emaclite in defconfigs
microblaze: Support simpleImage.dts make target
microblaze: Fix _start symbol to physical address
microblaze: Use LOAD_OFFSET macro to get correct LMA for all sections
microblaze: Create the LOAD_OFFSET macro used to compute VMA vs LMA offsets
microblaze: Copy ppc asm-compat.h for clean handling of constants in asm and C
microblaze: Actually show KiB rather than pages in "Freeing initrd memory:"
microblaze: Support ptrace syscall tracing.
microblaze: Updated CPU version and FPGA family codes in PVR
microblaze: Generate correct signal and siginfo for integer div-by-zero
microblaze: Don't be noisy when userspace causes hardware exceptions
microblaze: Remove ipc.h file which points to non-existing asm-generic file
microblaze: Clear sticky FSR register after generating exception signals
microblaze: Ensure CPU usermode is set on new userspace processes
microblaze: Use correct kbuild variable KBUILD_CFLAGS
microblaze: Save and restore msr in hw exception
microblaze: Add architectural support for USB EHCI host controllers
microblaze: Implement include/asm/syscall.h.
microblaze: Improve checking mechanism for MSR instruction
microblaze: Add checking mechanism for MSR instruction
...
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
module: don't call percpu_modfree on NULL pointer.
module: fix memory leak when load fails after srcversion/version allocated
module: preferred way to use MODULE_AUTHOR
param: allow whitespace as kernel parameter separator
module: reduce string table for loaded modules (v2)
module: reduce symbol table for loaded modules (v2)
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
lsm: Use a compressed IPv6 string format in audit events
Audit: send signal info if selinux is disabled
Audit: rearrange audit_context to save 16 bytes per struct
Audit: reorganize struct audit_watch to save 8 bytes
The general one handles NULL, the static obsolescent
(CONFIG_HAVE_LEGACY_PER_CPU_AREA) one in module.c doesn't; Eric's
commit 720eba31 assumed it did, and various frobbings since then kept
that assumption.
All other callers in module.c all protect it with an if; this effectively
does the same as free_init is only goto if we fail percpu_modalloc().
Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Normally the twisty paths of sysfs will free the attributes, but not if
we fail before we hook it into sysfs (which is the last thing we do in
load_module).
(This sysfs code is a turd, no doubt there are other issues lurking too).
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Some boot mechanisms require that kernel parameters are stored in a
separate file which is loaded to memory without further processing
(e.g. the "Load from FTP" method on s390). When such a file contains
newline characters, the kernel parameter preceding the newline might
not be correctly parsed (due to the newline being stuck to the end of
the actual parameter value) which can lead to boot failures.
This patch improves kernel command line usability in such a situation
by allowing generic whitespace characters as separators between kernel
parameters.
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Also remove all parts of the string table (referenced by the symbol
table) that are not needed for kallsyms use (i.e. which were only
referenced by symbols discarded by the previous patch, or not
referenced at all for whatever reason).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Discard all symbols not interesting for kallsyms use: absolute,
section, and in the common case (!KALLSYMS_ALL) data ones.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
Because the binfmt is not different between threads in the same process,
it can be moved from task_struct to mm_struct. And binfmt moudle is
handled per mm_struct instead of task_struct.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->ioctx_lock and ->ioctx_list are used only under CONFIG_AIO.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CLONE_PARENT was used to implement an older threading model. For
consistency with the CLONE_THREAD check in copy_pid_ns(), disable
CLONE_PARENT with CLONE_NEWPID, at least until the required semantics of
pid namespaces are clear.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When global or container-init processes use CLONE_PARENT, they create a
multi-rooted process tree. Besides siblings of global init remain as
zombies on exit since they are not reaped by their parent (swapper). So
prevent global and container-inits from creating siblings.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's unused.
It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.
It _was_ used in two places at arch/frv for some reason.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__fatal_signal_pending inlines to one instruction on x86, probably two
instructions on other machines. It takes two longer x86 instructions just
to call it and test its return value, not to mention the function itself.
On my random x86_64 config, this saved 70 bytes of text (59 of those being
__fatal_signal_pending itself).
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce do_send_sig_info() and convert group_send_sig_info(),
send_sig_info(), do_send_specific() to use this helper.
Hopefully it will have more users soon, it allows to specify
specific/group behaviour via "bool group" argument.
Shaves 80 bytes from .text.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stephane eranian <eranian@googlemail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce core pipe limiting sysctl.
Since we can dump cores to pipe, rather than directly to the filesystem,
we create a condition in which a user can create a very high load on the
system simply by running bad applications.
If the pipe reader specified in core_pattern is poorly written, we can
have lots of ourstandig resources and processes in the system.
This sysctl introduces an ability to limit that resource consumption.
core_pipe_limit defines how many in-flight dumps may be run in parallel,
dumps beyond this value are skipped and a note is made in the kernel log.
A special value of 0 in core_pipe_limit denotes unlimited core dumps may
be handled (this is the default value).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes tracehook_notify_jctl() so it's called with the siglock held,
and changes its argument and return value definition. These clean-ups
make it a better fit for what new tracing hooks need to check.
Tracing needs the siglock here, held from the time TASK_STOPPED was set,
to avoid potential SIGCONT races if it wants to allow any blocking in its
tracing hooks.
This also folds the finish_stop() function into its caller
do_signal_stop(). The function is short, called only once and only
unconditionally. It aids readability to fold it in.
[oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state]
[oleg@redhat.com: introduce tracehook_finish_jctl() helper]
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current behaviour of sys_waitid() looks odd. If user passes infop ==
NULL, sys_waitid() returns success. When user additionally specifies flag
WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user
combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.
This patch adds check for ->wo_info in wait_noreap_copyout().
User-visible change: starting from this commit, sys_waitid() always checks
infop != NULL and does not fail if it is NULL.
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_wait() checks ->wo_info to figure out who is the caller. If it's not
NULL the caller should be sys_waitid(), in that case do_wait() fixes up
the retval or zeros ->wo_info, depending on retval from underlying
function.
This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
incorrect value.
man 2 waitid says:
waitid(): returns 0 on success
Test-case:
int main(void)
{
if (fork())
assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);
return 0;
}
Result:
Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.
Move that code to sys_waitid().
User-visible change: sys_waitid() will return 0 on success, either
infop is set or not.
Note, there's another bug in wait_noreap_copyout() which affects
return value of sys_waitid(). It will be fixed in next patch.
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_pid_type() is only used by eligible_pid() which has to check wo_type
!= PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor
out ->pids[type] access, this shrinks .text a bit and simplifies the code.
The matches the behaviour of other similar helpers, say get_task_pid().
The caller must ensure that pid_type is valid, not the callee.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
child_wait_callback()->eligible_child() is not right, we can miss the
wakeup if the task was detached before __wake_up_parent() and the caller
of do_wait() didn't use __WALL.
Move ->wo_pid checks from eligible_child() to the new helper,
eligible_pid(), and change child_wait_callback() to use it instead of
eligible_child().
Note: actually I think it would be better to fix the __WCLONE check in
eligible_child(), it doesn't look exactly right. But it is not clear what
is the supposed behaviour, and any change is user-visible.
Reported-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested by Roland.
do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
it is ->real_parent and the child is not traced. IOW, caller == p->parent
otherwise we should not wake up.
Change child_wait_callback() to check this. Ratan reports the workload with
CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ratan Nalumasu reported that in a process with many threads doing
unnecessary wakeups. Every waiting thread in the process wakes up to loop
through the children and see that the only ones it cares about are still
not ready.
Now that we have struct wait_opts we can change do_wait/__wake_up_parent
to use filtered wakeups.
We can make child_wait_callback() more clever later, right now it only
checks eligible_child().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Preparation, no functional changes.
eligible_child() has a single caller, wait_consider_task(). We can move
security_task_wait() out from eligible_child(), this allows us to use it
for filtered wake_up().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The bug is old, it wasn't cause by recent changes.
Test case:
static void *tfunc(void *arg)
{
int pid = (long)arg;
assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
kill(pid, SIGKILL);
sleep(1);
return NULL;
}
int main(void)
{
pthread_t th;
long pid = fork();
if (!pid)
pause();
signal(SIGCHLD, SIG_IGN);
assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);
int r = waitpid(-1, NULL, __WNOTHREAD);
printf("waitpid: %d %m\n", r);
return 0;
}
Before the patch this program hangs, after this patch waitpid() correctly
fails with errno == -ECHILD.
The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
we should wake up other threads which can sleep in do_wait().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Organize cgroups over soft limit in a RB-Tree
Introduce an RB-Tree for storing memory cgroups that are over their soft
limit. The overall goal is to
1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
We are careful about updates, updates take place only after a particular
time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
limit
The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.
[hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an interface to allow get/set of soft limits. Soft limits for memory
plus swap controller (memsw) is currently not supported. Resource
counters have been enhanced to support soft limits and new type
RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be
directly set and do not need any reclaim or checks before setting them to
a newer value.
Kamezawa-San raised a question as to whether soft limit should belong to
res_counter. Since all resources understand the basic concepts of hard
and soft limits, it is justified to add soft limits here. Soft limits are
a generic resource usage feature, even file system quotas support soft
limits.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alter the ss->can_attach and ss->attach functions to be able to deal with
a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
pre-patch to cgroup-procs-writable.patch.)
Currently, new mode of the attach function can only tell the subsystem
about the old cgroup of the threadgroup leader. No subsystem currently
needs that information for each thread that's being moved, but if one were
to be added (for example, one that counts tasks within a group) this bit
would need to be reworked a bit to tell the subsystem the right
information.
[hidave.darkstar@gmail.com: fix build]
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changes css_set freeing mechanism to be under RCU
This is a prepatch for making the procs file writable. In order to free the
old css_sets for each task to be moved as they're being moved, the freeing
mechanism must be RCU-protected, or else we would have to have a call to
synchronize_rcu() for each task before freeing its old css_set.
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Separates all pidlist allocation requests to a separate function that
judges based on the requested size whether or not the array needs to be
vmalloced or can be gotten via kmalloc, and similar for kfree/vfree.
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process
seeing results from the other's namespace. Rather than one pidlist for
each file in a cgroup, we now keep a list of pidlists keyed by namespace
and file type (tasks versus procs) in which entries are placed on demand.
Each pidlist has its own lock, and that the pidlists themselves are passed
around in the seq_file's private pointer means we don't have to touch the
cgroup or its master list except when creating and destroying entries.
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct cgroup used to have a bunch of fields for keeping track of the
pidlist for the tasks file. Those are now separated into a new struct
cgroup_pidlist, of which two are had, one for procs and one for tasks.
The way the seq_file operations are set up is changed so that just the
pidlist struct gets passed around as the private data.
Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following series adds a "cgroup.procs" file to each cgroup that
reports unique tgids rather than pids, and allows all threads in a
threadgroup to be atomically moved to a new cgroup.
The subsystem "attach" interface is modified to support attaching whole
threadgroups at a time, which could introduce potential problems if any
subsystem were to need to access the old cgroup of every thread being
moved. The attach interface may need to be revised if this becomes the
case.
Also added is functionality for read/write locking all CLONE_THREAD
fork()ing within a threadgroup, by means of an rwsem that lives in the
sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
with the sighand's atomic count. This scheme should introduce no extra
overhead in the fork path when there's no contention.
The final patch reveals potential for a race when forking before a
subsystem's attach function is called - one potential solution in case any
subsystem has this problem is to hang on to the group's fork mutex through
the attach() calls, though no subsystem yet demonstrates need for an
extended critical section.
This patch:
Revert
commit 096b7fe012
Author: Li Zefan <lizf@cn.fujitsu.com>
AuthorDate: Wed Jul 29 15:04:04 2009 -0700
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Jul 29 19:10:35 2009 -0700
cgroups: fix pid namespace bug
This is in preparation for some clashing cgroups changes that subsume the
original commit's functionaliy.
The original commit fixed a pid namespace bug which Ben Blum fixed
independently (in the same way, but with different code) as part of a
series of patches. I played around with trying to reconcile Ben's patch
series with Li's patch, but concluded that it was simpler to just revert
Li's, given that Ben's patch series contained essentially the same fix.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the restriction that a cgroup hierarchy must have at
least one bound subsystem. The mount option "none" is treated as an
explicit request for no bound subsystems.
A hierarchy with no subsystems can be useful for plain task tracking, and
is also a step towards the support for multiply-bindable subsystems.
As part of this change, the hierarchy id is no longer calculated from the
bitmask of subsystems in the hierarchy (since this is not guaranteed to be
unique) but is allocated via an ida. Reference counts on cgroups from
css_set objects are now taken explicitly one per hierarchy, rather than
one per subsystem.
Example usage:
mount -t cgroup -o none,name=foo cgroup /mnt/cgroup
Based on the "no-op"/"none" subsystem concept proposed by
kamezawa.hiroyu@jp.fujitsu.com
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the cgroups code makes the assumption that the subsystem
pointers in a struct css_set uniquely identify the hierarchy->cgroup
mappings associated with the css_set; and there's no way to directly
identify the associated set of cgroups other than by indirecting through
the appropriate subsystem state pointers.
This patch removes the need for that assumption by adding a back-pointer
from struct cg_cgroup_link object to its associated cgroup; this allows
the set of cgroups to be determined by traversing the cg_links list in
the struct css_set.
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While it's architecturally clean to have the cgroup debug subsystem be
completely independent of the cgroups framework, it limits its usefulness
for debugging the contents of internal data structures. Move the debug
subsystem code into the scope of all the cgroups data structures to make
more detailed debugging possible.
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To simplify referring to cgroup hierarchies in mount statements, and to
allow disambiguation in the presence of empty hierarchies and
multiply-bindable subsystems this patch adds support for naming a new
cgroup hierarchy via the "name=" mount option
A pre-existing hierarchy may be specified by either name or by subsystems;
a hierarchy's name cannot be changed by a remount operation.
Example usage:
# To create a hierarchy called "foo" containing the "cpu" subsystem
mount -t cgroup -oname=foo,cpu cgroup /mnt/cgroup1
# To mount the "foo" hierarchy on a second location
mount -t cgroup -oname=foo cgroup /mnt/cgroup2
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the last unlock sequence consistent with previous unlock sequeue.
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many similar code in kernel for one object: convert time between
calendar time and broken-down time.
Here is some source I found:
fs/ncpfs/dir.c
fs/smbfs/proc.c
fs/fat/misc.c
fs/udf/udftime.c
fs/cifs/netmisc.c
net/netfilter/xt_time.c
drivers/scsi/ips.c
drivers/input/misc/hp_sdc_rtc.c
drivers/rtc/rtc-lib.c
arch/ia64/hp/sim/boot/fw-emu.c
arch/m68k/mac/misc.c
arch/powerpc/kernel/time.c
arch/parisc/include/asm/rtc.h
...
We can make a common function for this type of conversion, At least we
can get following benefit:
1: Make kernel simple and unify
2: Easy to fix bug in converting code
3: Reduce clone of code in future
For example, I'm trying to make ftrace display walltime,
this patch will make me easy.
This code is based on code from glibc-2.6
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup the useless dentry variable while creating a kernel
event set of files. trace_create_file() warns if it fails to
create the file anyway, and we don't store the dentry anywhere.
v2: Fix a small conflict in kernel/trace/trace_events.c
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cleanup remaining headers inclusion that were only useful when
the filter framework and its tracing related filesystem user interface
weren't yet separated.
v2: Keep module.h, needed for EXPORT_SYMBOL_GPL
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Audit will not respond to signal requests if selinux is disabled since it is
unable to translate the 0 sid from the sending process to a context. This
patch just doesn't send the context info if there isn't any.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pahole pointed out that on x86_64 struct audit_context can be rearrainged
to save 16 bytes per struct. Since we have an audit_context per task this
can acually be a pretty significant gain.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pahole showed that struct audit_watch had two holes:
struct audit_watch {
atomic_t count; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
char * path; /* 8 8 */
dev_t dev; /* 16 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int ino; /* 24 8 */
struct audit_parent * parent; /* 32 8 */
struct list_head wlist; /* 40 16 */
struct list_head rules; /* 56 16 */
/* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
/* size: 72, cachelines: 2, members: 7 */
/* sum members: 64, holes: 2, sum holes: 8 */
/* last cacheline: 8 bytes */
}; /* definitions: 1 */
by moving dev after count we save 8 bytes, actually improving cacheline
usage. There are typically very few of these in the kernel so it won't be
a large savings, but it's a good thing no matter what.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: (39 commits)
cpumask: Move deprecated functions to end of header.
cpumask: remove unused deprecated functions, avoid accusations of insanity
cpumask: use new-style cpumask ops in mm/quicklist.
cpumask: use mm_cpumask() wrapper: x86
cpumask: use mm_cpumask() wrapper: um
cpumask: use mm_cpumask() wrapper: mips
cpumask: use mm_cpumask() wrapper: mn10300
cpumask: use mm_cpumask() wrapper: m32r
cpumask: use mm_cpumask() wrapper: arm
cpumask: Use accessors for cpu_*_mask: um
cpumask: Use accessors for cpu_*_mask: powerpc
cpumask: Use accessors for cpu_*_mask: mips
cpumask: Use accessors for cpu_*_mask: m32r
cpumask: remove arch_send_call_function_ipi
cpumask: arch_send_call_function_ipi_mask: s390
cpumask: arch_send_call_function_ipi_mask: powerpc
cpumask: arch_send_call_function_ipi_mask: mips
cpumask: arch_send_call_function_ipi_mask: m32r
cpumask: arch_send_call_function_ipi_mask: alpha
cpumask: remove obsolete topology_core_siblings and topology_thread_siblings: ia64
...
* remove asm/atomic.h inclusion from linux/utsname.h --
not needed after kref conversion
* remove linux/utsname.h inclusion from files which do not need it
NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however
due to some personality stuff it _is_ needed -- cowardly leave ELF-related
headers and files alone.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit c02e3f361c ("kmod: fix race in usermodehelper code")
The patch is wrong. UMH_WAIT_EXEC is called with VFORK what ensures
that the child finishes prior returing back to the parent. No race.
In fact, the patch makes it even worse because it does the thing it
claims not do:
- It calls ->complete() on UMH_WAIT_EXEC
- the complete() callback may de-allocated subinfo as seen in the
following call chain:
[<c009f904>] (__link_path_walk+0x20/0xeb4) from [<c00a094c>] (path_walk+0x48/0x94)
[<c00a094c>] (path_walk+0x48/0x94) from [<c00a0a34>] (do_path_lookup+0x24/0x4c)
[<c00a0a34>] (do_path_lookup+0x24/0x4c) from [<c00a158c>] (do_filp_open+0xa4/0x83c)
[<c00a158c>] (do_filp_open+0xa4/0x83c) from [<c009ba90>] (open_exec+0x24/0xe0)
[<c009ba90>] (open_exec+0x24/0xe0) from [<c009bfa8>] (do_execve+0x7c/0x2e4)
[<c009bfa8>] (do_execve+0x7c/0x2e4) from [<c0026a80>] (kernel_execve+0x34/0x80)
[<c0026a80>] (kernel_execve+0x34/0x80) from [<c004b514>] (____call_usermodehelper+0x130/0x148)
[<c004b514>] (____call_usermodehelper+0x130/0x148) from [<c0024858>] (kernel_thread_exit+0x0/0x8)
and the path pointer was NULL. Good that ARM's kernel_execve()
doesn't check the pointer for NULL or else I wouldn't notice it.
The only race there might be is with UMH_NO_WAIT but it is too late for
me to investigate it now. UMH_WAIT_PROC could probably also use VFORK
and we could save one exec. So the only race I see is with UMH_NO_WAIT
and recent scheduler changes where the child does not always run first
might have trigger here something but as I said, it is late....
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove open-coded zalloc_cpumask_var() and zalloc_cpumask_var_node().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
SELinux: do not destroy the avc_cache_nodep
KEYS: Have the garbage collector set its timer for live expired keys
tpm-fixup-pcrs-sysfs-file-update
creds_are_invalid() needs to be exported for use by modules:
include/linux/cred.h: fix build
Fix trivial BUILD_BUG_ON-induced conflicts in drivers/char/tpm/tpm.c
mips allmodconfig:
include/linux/cred.h: In function `creds_are_invalid':
include/linux/cred.h:187: error: `PAGE_SIZE' undeclared (first use in this function)
include/linux/cred.h:187: error: (Each undeclared identifier is reported only once
include/linux/cred.h:187: error: for each function it appears in.)
Fixes
commit b6dff3ec5e
Author: David Howells <dhowells@redhat.com>
AuthorDate: Fri Nov 14 10:39:16 2008 +1100
Commit: James Morris <jmorris@namei.org>
CommitDate: Fri Nov 14 10:39:16 2008 +1100
CRED: Separate task security context from task_struct
I think.
It's way too large to be inlined anyway.
Dunno if this needs an EXPORT_SYMBOL() yet.
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
These issues identified during an old-fashioned face-to-face code
review extending over many hours.
o Add comments for tricky parts of code, and correct comments
that have passed their sell-by date.
o Get rid of the vestiges of rcu_init_sched(), which is no
longer needed now that PREEMPT_RCU is gone.
o Move the #include of rcutree_plugin.h to the end of
rcutree.c, which means that, rather than having a random
collection of forward declarations, the new set of forward
declarations document the set of plugins. The new home for
this #include also allows __rcu_init_preempt() to move into
rcutree_plugin.h.
o Fix rcu_preempt_check_callbacks() to be static.
Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12537246443924-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra <peterz@infradead.org>
These issues identified during an old-fashioned face-to-face code
review extended over many hours.
o Bury various forms of the "rsp->completed == rsp->gpnum"
comparison into an rcu_gp_in_progress() function, which has
the beneficial side-effect of forcing consistent use of
ACCESS_ONCE().
o Replace hand-coded arithmetic with DIV_ROUND_UP().
o Bury several "!list_empty(&rnp->blocked_tasks[rnp->gpnum & 0x01])"
instances into an rcu_preempted_readers() function, as this
expression indicates that there are no readers blocked
within RCU read-side critical sections blocking the current
grace period. (Though there might well be similar readers
blocking the next grace period.)
o Remove a dangling rcu_restart_cpu() declaration that has
been dangling for almost 20 minor releases of the kernel.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12537246442687-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Originally, walk_memory_resource() was introduced to traverse all memory
of "System RAM" for detecting memory hotplug/unplug range. For doing so,
flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
memory hotplug.
But for using other purpose, /proc/kcore, this may includes some firmware
area marked as IORESOURCE_BUSY | IORESOUCE_MEM. This patch makes the
check strict to find out busy "System RAM".
Note: PPC64 keeps their own walk_memory_resouce(), which walk through
ppc64's lmb informaton. Because old kclist_add() is called per lmb, this
patch makes no difference in behavior, finally.
And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
Because pfn_valid() just show "there is memmap or not* and cannot be used
for "there is physical memory or not", this function is useful in generic
to scan physical memory range.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A patch to give a better overview of the userland application stack usage,
especially for embedded linux.
Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.
There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.
A sample output of /proc/<pid>/task/<tid>/maps looks like:
08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
which will you give the current stack usage in kb.
A sample output of /proc/self/status looks like:
Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB
I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.
[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows lockdep to locate symbols that are in arch-specific data
sections (such as data in Blackfin on-chip SRAM regions).
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows kallsyms to locate symbols that are in arch-specific text
sections (such as text in Blackfin on-chip SRAM regions).
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The act of a process becoming a session leader is a useful signal to a
supervising init daemon such as Upstart.
While a daemon will normally do this as part of the process of becoming a
daemon, it is rare for its children to do so. When the children do, it is
nearly always a sign that the child should be considered detached from the
parent and not supervised along with it.
The poster-child example is OpenSSH; the per-login children call setsid()
so that they may control the pty connected to them. If the primary daemon
dies or is restarted, we do not want to consider the per-login children
and want to respawn the primary daemon without killing the children.
This patch adds a new PROC_SID_EVENT and associated structure to the
proc_event event_data union, it arranges for this to be emitted when the
special PIDTYPE_SID pid is set.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Scott James Remnant <scott@ubuntu.com>
Acked-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make all seq_operations structs const, to help mitigate against
revectoring user-triggerable function pointers.
This is derived from the grsecurity patch, although generated from scratch
because it's simpler than extracting the changes from there.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch can remove spinlock from struct call_function_data, the
reasons are below:
1: add a new interface for cpumask named cpumask_test_and_clear_cpu(),
it can atomically test and clear specific cpu, we can use it instead
of cpumask_test_cpu() and cpumask_clear_cpu() and no need data->lock
to protect those in generic_smp_call_function_interrupt().
2: in smp_call_function_many(), after csd_lock() return, the current's
cfd_data is deleted from call_function list, so it not have race
between other cpus, then cfs_data is only used in
smp_call_function_many() that must disable preemption and not from
a hardware interrupthandler or from a bottom half handler to call,
only the correspond cpu can use it, so it not have race in current
cpu, no need cfs_data->lock to protect it.
3: after 1 and 2, cfs_data->lock is only use to protect cfs_data->refs in
generic_smp_call_function_interrupt(), so we can define cfs_data->refs
to atomic_t, and no need cfs_data->lock any more.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
[akpm@linux-foundation.org: use atomic_dec_return()]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The user mode helper code has a race in it. call_usermodehelper_exec()
takes an allocated subprocess_info structure, which it passes to a
workqueue, and then passes it to a kernel thread which it creates, after
which it calls complete to signal to the caller of
call_usermodehelper_exec() that it can free the subprocess_info struct.
But since we use that structure in the created thread, we can't call
complete from __call_usermodehelper(), which is where we create the kernel
thread. We need to call complete() from within the kernel thread and then
not use subprocess_info afterward in the case of UMH_WAIT_EXEC. Tested
successfully by me.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When syslog is not possible, at the same time there's no serial/net
console available, it will be hard to read the printk messages. For
example oops/panic/warning messages in shutdown phase.
Add a printk delay feature, we can make each printk message delay some
milliseconds.
Setting the delay by proc/sysctl interface: /proc/sys/kernel/printk_delay
The value range from 0 - 10000, default value is 0
[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename `printk_delay_msec' to `loops_per_msec', because the patch "printk:
add printk_delay to make messages readable for some scenarios" wishes to
more appropriately use the `printk_delay_msec' identifier.
[akpm@linux-foundation.org: add a comment]
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus reported this new build warning:
kernel/module.c:2951: warning: ?struct marker? declared inside parameter list
kernel/module.c:2951: warning: its scope is only this definition or declaration, which is probably not what you want
Caused by:
fc53776: tracing: Remove markers
module_layout() is an artificial symbol with 'significant' symbols
listed in its argument list so that it gets a proper argument types
signature that modversions can pick up to decide whether a
module is version-compatible or not. If these dont match then we
wont even look at a module.
Remove the stale marker symbol.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <alpine.LFD.2.01.0909210908020.4950@localhost.localdomain>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
trivial: fix typo in aic7xxx comment
trivial: fix comment typo in drivers/ata/pata_hpt37x.c
trivial: typo in kernel-parameters.txt
trivial: fix typo in tracing documentation
trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
trivial: remove unnecessary semicolons
trivial: Fix duplicated word "options" in comment
trivial: kbuild: remove extraneous blank line after declaration of usage()
trivial: improve help text for mm debug config options
trivial: doc: hpfall: accept disk device to unload as argument
trivial: doc: hpfall: reduce risk that hpfall can do harm
trivial: SubmittingPatches: Fix reference to renumbered step
trivial: fix typos "man[ae]g?ment" -> "management"
trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
trivial: fix missing printk space in amd_k7_smp_check
trivial: fix typo s/ketymap/keymap/ in comment
trivial: fix typo "to to" in multiple files
trivial: fix typos in comments s/DGBU/DBGU/
...
Decouple kernel.h from ratelimit.h: the global declaration of
printk's ratelimit_state is not needed, and it leads to messy
circular dependencies due to ratelimit.h's (new) adding of a
spinlock_types.h include.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the menu idle governor which balances power savings, energy efficiency
and performance impact.
The reason for a reworked governor is that there have been serious
performance issues reported with the existing code on Nehalem server
systems.
To show this I'm sure Andrew wants to see benchmark results:
(benchmark is "fio", "no cstates" is using "idle=poll")
no cstates current linux new algorithm
1 disk 107 Mb/s 85 Mb/s 105 Mb/s
2 disks 215 Mb/s 123 Mb/s 209 Mb/s
12 disks 590 Mb/s 320 Mb/s 585 Mb/s
In various power benchmark measurements, no degredation was found by our
measurement&diagnostics team. Obviously a small percentage more power was
used in the "fio" benchmark, due to the much higher performance.
While it would be a novel idea to describe the new algorithm in this
commit message, I cheaped out and described it in comments in the code
instead.
[changes since first post: spelling fixes from akpm, review feedback,
folded menu-tng into menu.c]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some architectures (like the Blackfin arch) implement some of the
"simpler" features that one would expect out of a MMU such as memory
protection.
In our case, we actually get read/write/exec protection down to the page
boundary so processes can't stomp on each other let alone the kernel.
There is a performance decrease (which depends greatly on the workload)
however as the hardware/software interaction was not optimized at design
time.
Signed-off-by: Bernd Schmidt <bernds_cb1@t-online.de>
Signed-off-by: Bryan Wu <cooloney@kernel.org>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, OOM logic callflow is here.
__out_of_memory()
select_bad_process() for each task
badness() calculate badness of one task
oom_kill_process() search child
oom_kill_task() kill target task and mm shared tasks with it
example, process-A have two thread, thread-A and thread-B and it have very
fat memory and each thread have following oom_adj and oom_score.
thread-A: oom_adj = OOM_DISABLE, oom_score = 0
thread-B: oom_adj = 0, oom_score = very-high
Then, select_bad_process() select thread-B, but oom_kill_task() refuse
kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory()
call select_bad_process() again. but select_bad_process() select the same
task. It mean kernel fall in livelock.
The fact is, select_bad_process() must select killable task. otherwise
OOM logic go into livelock.
And root cause is, oom_adj shouldn't be per-thread value. it should be
per-process value because OOM-killer kill a process, not thread. Thus
This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct signal_struct. it naturally prevent
select_bad_process() choose wrong task.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is being done by allowing boot time allocations to specify that they
may want a sub-page sized amount of memory.
Overall this seems more consistent with the other hash table allocations,
and allows making two supposedly mm-only variables really mm-only
(nr_{kernel,all}_pages).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since alloc_bootmem() will never return inaccessible (via virtual
addressing) memory anyway, using the ..._low() variant only makes sense
when the physical address range of the allocated memory must fulfill
further constraints, espacially since on 64-bits (or more generally in all
cases where the pools the two variants allocate from are than the full
available range.
Probably the use in alloc_tce_table() could also be eliminated (based on
code inspection of pci-calgary_64.c), but that seems too risky given I
know nothing about that hardware and have no way to test it.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rawhide users have reported hang at startup when cryptsetup is run: the
same problem can be simply reproduced by running a program int main() {
mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }
The problem is that exit_mmap() applies munlock_vma_pages_all() to
clean up VM_LOCKED areas, and its current implementation (stupidly)
tries to fault in absent pages, for example where PROT_NONE prevented
them being faulted in when mlocking. Whereas the "ksm: fix oom
deadlock" patch, knowing there's a race by which KSM might try to fault
in pages after exit_mmap() had finally zapped the range, backs out of
such faults doing nothing when its ksm_test_exit() notices mm_users 0.
So revert that part of "ksm: fix oom deadlock" which moved the
ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
and remove those ksm_test_exit() checks from the page fault paths, so
allowing the munlocking to proceed without interference.
ksm_exit, if there are rmap_items still chained on this mm slot, takes
mmap_sem write side: so preventing KSM from working on an mm while
exit_mmap runs. And KSM will bail out as soon as it notices that
mm_users is already zero, thanks to its internal ksm_test_exit checks.
So that when a task is killed by OOM killer or the user, KSM will not
indefinitely prevent it from running exit_mmap to release its memory.
This does break a part of what "ksm: fix oom deadlock" was trying to
achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
when ksmd itself has to cancel a KSM page, it is possible that the
first OOM-kill victim would be the KSM process being faulted: then its
memory won't be freed until a second victim has been selected (freeing
memory for the unmerging fault to complete).
But the OOM killer is already liable to kill a second victim once the
intended victim's p->mm goes to NULL: so there's not much point in
rejecting this KSM patch before fixing that OOM behaviour. It is very
much more important to allow KSM users to boot up, than to haggle over
an unlikely and poorly supported OOM case.
We also intend to fix munlocking to not fault pages: at which point
this patch _could_ be reverted; though that would be controversial, so
we hope to find a better solution.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Justin M. Forbes <jforbes@redhat.com>
Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a now-obvious deadlock in KSM's out-of-memory handling:
imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
trying to allocate a page to break KSM in an mm which becomes the
OOM victim (quite likely in the unmerge case): it's killed and goes
to exit, and hangs there waiting to acquire ksm_thread_mutex.
Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
though that made everything else: perhaps use mmap_sem somehow?
And part of the answer lies in the comments on unmerge_ksm_pages:
__ksm_exit should also leave all the rmap_item removal to ksmd.
But there's a fundamental problem, that KSM relies upon mmap_sem to
guarantee the consistency of the mm it's dealing with, yet exit_mmap
tears down an mm without taking mmap_sem. And bumping mm_users won't
help at all, that just ensures that the pages the OOM killer assumes
are on their way to being freed will not be freed.
The best answer seems to be, to move the ksm_exit callout from just
before exit_mmap, to the middle of exit_mmap: after the mm's pages
have been freed (if the mmu_gather is flushed), but before its page
tables and vma structures have been freed; and down_write,up_write
mmap_sem there to serialize with KSM's own reliance on mmap_sem.
But KSM then needs to be careful, whenever it downs mmap_sem, to
check that the mm is not already exiting: there's a danger of using
find_vma on a layout that's being torn apart, or writing into page
tables which have been freed for reuse; and even do_anonymous_page
and __do_fault need to check they're not being called by break_ksm
to reinstate a pte after zap_pte_range has zapped that page table.
Though it might be clearer to add an exiting flag, set while holding
mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
a zapped pte. All we need is to check whether mm_users is 0 - but
must remember that ksmd may detect that before __ksm_exit is reached.
So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.
__ksm_exit now has to leave clearing up the rmap_items to ksmd,
that needs ksm_thread_mutex; but shift the exiting mm just after the
ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise
mm_count to hold the mm_struct, ksmd's exit processing (exactly like
its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).
But also give __ksm_exit a fast path: when there's no complication
(no rmap_items attached to mm and it's not at the ksm_scan cursor),
it can safely do all the exiting work itself. This is not just an
optimization: when ksmd is not running, the raised mm_count would
otherwise leak mm_structs.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch presents the mm interface to a dummy version of ksm.c, for
better scrutiny of that interface: the real ksm.c follows later.
When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
pretending that they can be serviced. But when CONFIG_KSM=y, accept them
even if KSM is not currently running, and even on areas which KSM will not
touch (e.g. hugetlb or shared file or special driver mappings).
Like other madvices, report ENOMEM despite success if any area in the
range is unmapped, and use EAGAIN to report out of memory.
Define vma flag VM_MERGEABLE to identify an area on which KSM may try
merging pages: leave it to ksm_madvise() to decide whether to set it.
Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
VM_MERGEABLE areas, to minimize callouts when forking or exiting.
Based upon earlier patches by Chris Wright and Izik Eidus.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The amount of memory allocated to kernel stacks can become significant and
cause OOM conditions. However, we do not display the amount of memory
consumed by stacks.
Add code to display the amount of memory used for stacks in /proc/meminfo.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PI futexes do not use the same plist_node_empty() test for wakeup.
It was possible for the waiter (in futex_wait_requeue_pi()) to set
TASK_INTERRUPTIBLE after the waker assigned the rtmutex to the
waiter. The waiter would then note the plist was not empty and call
schedule(). The task would not be found by any subsequeuent futex
wakeups, resulting in a userspace hang.
By moving the setting of TASK_INTERRUPTIBLE to before the call to
queue_me(), the race with the waker is eliminated. Since we no
longer call get_user() from within queue_me(), there is no need to
delay the setting of TASK_INTERRUPTIBLE until after the call to
queue_me().
The FUTEX_LOCK_PI operation is not affected as futex_lock_pi()
relies entirely on the rtmutex code to handle schedule() and
wakeup. The requeue PI code is affected because the waiter starts
as a non-PI waiter and is woken on a PI futex.
Remove the crusty old comment about holding spinlocks() across
get_user() as we no longer do that. Correct the locking statement
with a description of why the test is performed.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090922053038.8717.97838.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use kernel-doc format to describe struct futex_q.
Correct the wakeup definition to eliminate the statement about
waking the waiter between the plist_del() and the q->lock_ptr = 0.
Note in the comment that PI futexes have a different definition of
the woken state.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090922053029.8717.62798.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the existing function kernel-doc consistent throughout
futex.c, following Documentation/kernel-doc-nano-howto.txt as
closely as possible.
When unsure, at least be consistent within futex.c.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090922053022.8717.13339.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The queue_me/unqueue_me commentary is oddly placed and out of date.
Clean it up and correct the inaccurate bits.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090922053015.8717.71713.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Correct various typos and formatting inconsistencies in the
commentary of futex_wait_requeue_pi().
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090922052958.8717.21932.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Leave the last slot for the tailing '\0'.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4AB865FA.5080801@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
sparc64 allnoconfig:
arch/sparc/kernel/built-in.o(.text+0x134e0): In function `sys32_recvfrom':
: undefined reference to `compat_sys_recvfrom'
arch/sparc/kernel/built-in.o(.text+0x134e4): In function `sys32_recvfrom':
: undefined reference to `compat_sys_recvfrom'
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Tidy up after the big rename
perf: Do the big rename: Performance Counters -> Performance Events
perf_counter: Rename 'event' to event_id/hw_event
perf_counter: Rename list_entry -> group_entry, counter_list -> group_list
Manually resolved some fairly trivial conflicts with the tracing tree in
include/trace/ftrace.h and kernel/trace/trace_syscalls.c.
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: Fix whitespace inconsistencies
rcu: Fix thinko, actually initialize full tree
rcu: Apply results of code inspection of kernel/rcutree_plugin.h
rcu: Add WARN_ON_ONCE() consistency checks covering state transitions
rcu: Fix synchronize_rcu() for TREE_PREEMPT_RCU
rcu: Simplify rcu_read_unlock_special() quiescent-state accounting
rcu: Add debug checks to TREE_PREEMPT_RCU for premature grace periods
rcu: Kconfig help needs to say that TREE_PREEMPT_RCU scales down
rcutorture: Occasionally delay readers enough to make RCU force_quiescent_state
rcu: Initialize multi-level RCU grace periods holding locks
rcu: Need to update rnp->gpnum if preemptable RCU is to be reliable
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Simplify sys_sched_rr_get_interval() system call
sched: Fix potential NULL derference of doms_cur
sched: Fix raciness in runqueue_is_locked()
sched: Re-add lost cpu_allowed check to sched_fair.c::select_task_rq_fair()
sched: Remove unneeded indentation in sched_fair.c::place_entity()
this was introduced in
5e0a093 (tracing: fix config options to not show when automatically selected)
Signed-off-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: trivial@kernel.org
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
- provide compatibility Kconfig entry for existing PERF_COUNTERS .config's
- provide courtesy copy of old perf_counter.h, for user-space projects
- small indentation fixups
- fix up MAINTAINERS
- fix small x86 printout fallout
- fix up small PowerPC comment fallout (use 'counter' as in register)
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In preparation to the renames, to avoid a namespace clash.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is in preparation of the big rename, but also makes sense
in a standalone way: 'list_entry' is a bad name as we already
have a list_entry() in list.h.
Also, the 'counter list' is too vague, it doesnt tell us the
purpose of that list.
Clarify these names to show that it's all about the group
hiearchy.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Cpus_allowed fields in /proc/<pid>/status is currently only
shown in case of CONFIG_CPUSETS. However their contents are also
useful for the !CONFIG_CPUSETS case.
So change the current behaviour and always show these fields.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090921090627.GD4649@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
By removing the need for it to know details of scheduling classes.
This allows PlugSched to define orthogonal scheduling classes.
Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <06d1b89ee15a0eef82d7.1253496713@mudlark.pw.nest>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (79 commits)
USB serial: update the console driver
usb-serial: straighten out serial_open
usb-serial: add missing tests and debug lines
usb-serial: rename subroutines
usb-serial: fix termios initialization logic
usb-serial: acquire references when a new tty is installed
usb-serial: change logic of serial lookups
usb-serial: put subroutines in logical order
usb-serial: change referencing of port and serial structures
tty: Char: mxser, use THRE for ASPP_OQUEUE ioctl
tty: Char: mxser, add support for CP112UL
uartlite: support shared interrupt lines
tty: USB: serial/mct_u232, fix tty refcnt
tty: riscom8, fix tty refcnt
tty: riscom8, fix shutdown declaration
TTY: fix typos
tty: Power: fix suspend vt regression
tty: vt: use printk_once
tty: handle VT specific compat ioctls in vt driver
n_tty: move echoctl check and clean up logic
...
* 'perfcounters-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (58 commits)
perf_counter: Fix perf_copy_attr() pointer arithmetic
perf utils: Use a define for the maximum length of a trace event
perf: Add timechart help text and add timechart to "perf help"
tracing, x86, cpuidle: Move the end point of a C state in the power tracer
perf utils: Be consistent about minimum text size in the svghelper
perf timechart: Add "perf timechart record"
perf: Add the timechart tool
perf: Add a SVG helper library file
tracing, perf: Convert the power tracer into an event tracer
perf: Add a sample_event type to the event_union
perf: Allow perf utilities to have "callback" options without arguments
perf: Store trace event name/id pairs in perf.data
perf: Add a timestamp to fork events
sched_clock: Make it NMI safe
perf_counter: Fix up swcounter throttling
x86, perf_counter, bts: Optimize BTS overflow handling
perf sched: Add --input=file option to builtin-sched.c
perf trace: Sample timestamp and cpu when using record flag
perf tools: Increase MAX_EVENT_LENGTH
perf tools: Fix memory leak in read_ftrace_printk()
...
If CONFIG_CPUMASK_OFFSTACK is enabled but doms_cur alloc failed in
arch_init_sched_domains(), doms_cur will move back to
fallback_doms. But this time, fallback_doms has not been
initialized yet.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: a.p.zijlstra@chello.nl
LKML-Reference: <1252930816-7672-1-git-send-email-yong.zhang0@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
runqueue_is_locked() is unavoidably racy due to a poor interface design.
It does
cpu = get_cpu()
ret = some_perpcu_thing(cpu);
put_cpu(cpu);
return ret;
Its return value is unreliable.
Fix.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <200909191855.n8JItiko022148@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>