Complete the syscall-less self-profiling feature and address
all complaints, namely:
- capabilities, so we can detect what is actually available at runtime
Add a capabilities field to perf_event_mmap_page to indicate
what is actually available for use.
- on x86: RDPMC weirdness due to being 40/48 bits and not sign-extending
properly.
- ABI documentation as to how all this stuff works.
Also improve the documentation for the new features.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1332433596.2487.33.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
. Short term fix for 'diff' tool breakage related to perf.data files
with multiple events. From Jiri Olsa
. Cleanup for event id tracepoint reading routine, from Borislav Petkov
. 32-bit compilation fixes from Jiri Olsa
. Event parsing modifier assignment fixes from Jiri Olsa
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJPa24hAAoJENZQFvNTUqpARrEQAIXfsXG2X45twEFHJ0oOmvZI
7F28pMFM0OBJdSq/MKyO86+j5wtbtKtjCcVZfVm1ukkeD7KnKQv+iyBw1rEuE/dx
IJIIS+zkCEAje7s64diNax/rpvZXp+k92pKKo/JGjalsVcw63VF5n0Ej3JA+n8Qp
anKyJd2SJFogoljTzRnFnZ+zsbn5CWVwb3NVRhH8t+jfKKOsmw8Nq5ds5ysuRf8x
+Xpm3YAsyKzAJMd05Uo5no+Ne4A5TukySiarZl3wiQ0TrwM9mVxQVAU5iVkAUZkb
Bp1dgBOgG1RNudcRSgbq3NuO7TGFby6DqODgXvo1TzpLyixKbWUz3JI0mEya5xCL
eBxJ2UTQpruBg+XJ3C0ys74TkMcQUeyxhqbwPu1WhMbP7cIJK1W1hJ728qWRG94a
uxFMFlEfc0d9kxlsXVvxUumiyhEw5gPtwdbQf7fo2xpMPQmHCC3yS7DdMO6FBfUW
oqPC7gb3aSTmHnKuYvwbT9EcOL27YHU+Y/4sRJ/QfohArgc4h60dUQrE9fWz7dXH
/eCXDvMnBs2+z52dNLq/mkoQMP4iFzWqboDNn5GM4jI2LmXd/hxBGL6lb2Vzm8CI
QYi5xC4E6VStSkJG2DTtvaYTclWOjo+ZrsGz7sDXQV3MgZ1wiZIWcGiPjeFBWU25
mHWkZT3of/MAbewmeklc
=V7d9
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Cleanups and fixes for perf/core:
. Short term fix for 'diff' tool breakage related to perf.data files
with multiple events. From Jiri Olsa
. Cleanup for event id tracepoint reading routine, from Borislav Petkov
. 32-bit compilation fixes from Jiri Olsa
. Event parsing modifier assignment fixes from Jiri Olsa
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull timer changes for v3.4 from Ingo Molnar
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
ntp: Fix integer overflow when setting time
math: Introduce div64_long
cs5535-clockevt: Allow the MFGPT IRQ to be shared
cs5535-clockevt: Don't ignore MFGPT on SMP-capable kernels
x86/time: Eliminate unused irq0_irqs counter
clocksource: scx200_hrt: Fix the build
x86/tsc: Reduce the TSC sync check time for core-siblings
timer: Fix bad idle check on irq entry
nohz: Remove ts->Einidle checks before restarting the tick
nohz: Remove update_ts_time_stat from tick_nohz_start_idle
clockevents: Leave the broadcast device in shutdown mode when not needed
clocksource: Load the ACPI PM clocksource asynchronously
clocksource: scx200_hrt: Convert scx200 to use clocksource_register_hz
clocksource: Get rid of clocksource_calc_mult_shift()
clocksource: dbx500: convert to clocksource_register_hz()
clocksource: scx200_hrt: use pr_<level> instead of printk
time: Move common updates to a function
time: Reorder so the hot data is together
time: Remove most of xtime_lock usage in timekeeping.c
ntp: Add ntp_lock to replace xtime_locking
...
Pull scheduler changes for v3.4 from Ingo Molnar
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
printk: Make it compile with !CONFIG_PRINTK
sched/x86: Fix overflow in cyc2ns_offset
sched: Fix nohz load accounting -- again!
sched: Update yield() docs
printk/sched: Introduce special printk_sched() for those awkward moments
sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
sched: Cleanup cpu_active madness
sched: Fix load-balance wreckage
sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
sched: Ditch per cgroup task lists for load-balancing
sched: Rename load-balancing fields
sched: Move load-balancing arguments into helper struct
sched/rt: Do not submit new work when PI-blocked
sched/rt: Prevent idle task boosting
sched/wait: Add __wake_up_all_locked() API
sched/rt: Document scheduler related skip-resched-check sites
sched/rt: Use schedule_preempt_disabled()
sched/rt: Add schedule_preempt_disabled()
sched/rt: Do not throttle when PI boosting
sched/rt: Keep period timer ticking when rt throttling is active
...
Pull perf events changes for v3.4 from Ingo Molnar:
- New "hardware based branch profiling" feature both on the kernel and
the tooling side, on CPUs that support it. (modern x86 Intel CPUs
with the 'LBR' hardware feature currently.)
This new feature is basically a sophisticated 'magnifying glass' for
branch execution - something that is pretty difficult to extract from
regular, function histogram centric profiles.
The simplest mode is activated via 'perf record -b', and the result
looks like this in perf report:
$ perf record -b any_call,u -e cycles:u branchy
$ perf report -b --sort=symbol
52.34% [.] main [.] f1
24.04% [.] f1 [.] f3
23.60% [.] f1 [.] f2
0.01% [k] _IO_new_file_xsputn [k] _IO_file_overflow
0.01% [k] _IO_vfprintf_internal [k] _IO_new_file_xsputn
0.01% [k] _IO_vfprintf_internal [k] strchrnul
0.01% [k] __printf [k] _IO_vfprintf_internal
0.01% [k] main [k] __printf
This output shows from/to branch columns and shows the highest
percentage (from,to) jump combinations - i.e. the most likely taken
branches in the system. "branches" can also include function calls
and any other synchronous and asynchronous transitions of the
instruction pointer that are not 'next instruction' - such as system
calls, traps, interrupts, etc.
This feature comes with (hopefully intuitive) flat ascii and TUI
support in perf report.
- Various 'perf annotate' visual improvements for us assembly junkies.
It will now recognize function calls in the TUI and by hitting enter
you can follow the call (recursively) and back, amongst other
improvements.
- Multiple threads/processes recording support in perf record, perf
stat, perf top - which is activated via a comma-list of PIDs:
perf top -p 21483,21485
perf stat -p 21483,21485 -ddd
perf record -p 21483,21485
- Support for per UID views, via the --uid paramter to perf top, perf
report, etc. For example 'perf top --uid mingo' will only show the
tasks that I am running, excluding other users, root, etc.
- Jump label restructurings and improvements - this includes the
factoring out of the (hopefully much clearer) include/linux/static_key.h
generic facility:
struct static_key key = STATIC_KEY_INIT_FALSE;
...
if (static_key_false(&key))
do unlikely code
else
do likely code
...
static_key_slow_inc();
...
static_key_slow_inc();
...
The static_key_false() branch will be generated into the code with as
little impact to the likely code path as possible. the
static_key_slow_*() APIs flip the branch via live kernel code patching.
This facility can now be used more widely within the kernel to
micro-optimize hot branches whose likelihood matches the static-key
usage and fast/slow cost patterns.
- SW function tracer improvements: perf support and filtering support.
- Various hardenings of the perf.data ABI, to make older perf.data's
smoother on newer tool versions, to make new features integrate more
smoothly, to support cross-endian recording/analyzing workflows
better, etc.
- Restructuring of the kprobes code, the splitting out of 'optprobes',
and a corner case bugfix.
- Allow the tracing of kernel console output (printk).
- Improvements/fixes to user-space RDPMC support, allowing user-space
self-profiling code to extract PMU counts without performing any
system calls, while playing nice with the kernel side.
- 'perf bench' improvements
- ... and lots of internal restructurings, cleanups and fixes that made
these features possible. And, as usual this list is incomplete as
there were also lots of other improvements
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits)
perf report: Fix annotate double quit issue in branch view mode
perf report: Remove duplicate annotate choice in branch view mode
perf/x86: Prettify pmu config literals
perf report: Enable TUI in branch view mode
perf report: Auto-detect branch stack sampling mode
perf record: Add HEADER_BRANCH_STACK tag
perf record: Provide default branch stack sampling mode option
perf tools: Make perf able to read files from older ABIs
perf tools: Fix ABI compatibility bug in print_event_desc()
perf tools: Enable reading of perf.data files from different ABI rev
perf: Add ABI reference sizes
perf report: Add support for taken branch sampling
perf record: Add support for sampling taken branch
perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK
x86/kprobes: Split out optprobe related code to kprobes-opt.c
x86/kprobes: Fix a bug which can modify kernel code permanently
x86/kprobes: Fix instruction recovery on optimized path
perf: Add callback to flush branch_stack on context switch
perf: Disable PERF_SAMPLE_BRANCH_* when not supported
perf/x86: Add LBR software filter support for Intel CPUs
...
Pull irq/core changes for v3.4 from Ingo Molnar
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Remove paranoid warnons and bogus fixups
genirq: Flush the irq thread on synchronization
genirq: Get rid of unnecessary IRQTF_DIED flag
genirq: No need to check IRQTF_DIED before stopping a thread handler
genirq: Get rid of unnecessary irqaction field in task_struct
genirq: Fix incorrect check for forced IRQ thread handler
softirq: Reduce invoke_softirq() code duplication
genirq: Fix long-term regression in genirq irq_set_irq_type() handling
x86-32/irq: Don't switch to irq stack for a user-mode irq
Pull RCU changes for v3.4 from Ingo Molnar. The major features of this
series are:
- making RCU more aggressive about entering dyntick-idle mode in order
to improve energy efficiency
- converting a few more call_rcu()s to kfree_rcu()s
- applying a number of rcutree fixes and cleanups to rcutiny
- removing CONFIG_SMP #ifdefs from treercu
- allowing RCU CPU stall times to be set via sysfs
- adding CPU-stall capability to rcutorture
- adding more RCU-abuse diagnostics
- updating documentation
- fixing yet more issues located by the still-ongoing top-to-bottom
inspection of RCU, this time with a special focus on the CPU-hotplug
code path.
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rcu: Stop spurious warnings from synchronize_sched_expedited
rcu: Hold off RCU_FAST_NO_HZ after timer posted
rcu: Eliminate softirq-mediated RCU_FAST_NO_HZ idle-entry loop
rcu: Add RCU_NONIDLE() for idle-loop RCU read-side critical sections
rcu: Allow nesting of rcu_idle_enter() and rcu_idle_exit()
rcu: Remove redundant check for rcu_head misalignment
PTR_ERR should be called before its argument is cleared.
rcu: Convert WARN_ON_ONCE() in rcu_lock_acquire() to lockdep
rcu: Trace only after NULL-pointer check
rcu: Call out dangers of expedited RCU primitives
rcu: Rework detection of use of RCU by offline CPUs
lockdep: Add CPU-idle/offline warning to lockdep-RCU splat
rcu: No interrupt disabling for rcu_prepare_for_idle()
rcu: Move synchronize_sched_expedited() to rcutree.c
rcu: Check for illegal use of RCU from offlined CPUs
rcu: Update stall-warning documentation
rcu: Add CPU-stall capability to rcutorture
rcu: Make documentation give more realistic rcutorture duration
rcutorture: Permit holding off CPU-hotplug operations during boot
rcu: Print scheduling-clock information on RCU CPU stall-warning messages
...
The tracing_on/off() declarations were under CONFIG_RING_BUFFER, but
the functions are now only defined under CONFIG_TRACING as they are
specific to ftrace and not the ring buffer.
But the declarations were still defined under the ring buffer and
this caused the build to fail when CONFIG_RING_BUFFER was set but
CONFIG_TRACING was not.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding sysfs group 'format' attribute for pmu device that
contains a syntax description on how to construct raw events.
The event configuration is described in following
struct pefr_event_attr attributes:
config
config1
config2
Each sysfs attribute within the format attribute group,
describes mapping of name and bitfield definition within
one of above attributes.
eg:
"/sys/...<dev>/format/event" contains "config:0-7"
"/sys/...<dev>/format/umask" contains "config:8-15"
"/sys/...<dev>/format/usr" contains "config:16"
the attribute value syntax is:
line: config ':' bits
config: 'config' | 'config1' | 'config2"
bits: bits ',' bit_term | bit_term
bit_term: VALUE '-' VALUE | VALUE
Adding format attribute definitions for x86 cpu pmus.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/n/tip-vhdk5y2hyype9j63prymty36@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add a div64_long macro which is used to devide a 64bit number by a long (which
can be 4 bytes on 32bit systems and 8 bytes on 64bit systems).
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: johnstul@us.ibm.com
Link: http://lkml.kernel.org/r/1331829374-31543-1-git-send-email-levinsasha928@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull block fixes from Jens Axboe:
"Been sitting on this for a while, but lets get this out the door.
This fixes various important bugs for 3.3 final, along with a few more
trivial ones. Please pull!"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix ioc leak in put_io_context
block, sx8: fix pointer math issue getting fw version
Block: use a freezable workqueue for disk-event polling
drivers/block/DAC960: fix -Wuninitialized warning
drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
block: fix __blkdev_get and add_disk race condition
block: Fix setting bio flags in drivers (sd_dif/floppy)
block: Fix NULL pointer dereference in sd_revalidate_disk
block: exit_io_context() should call elevator_exit_icq_fn()
block: simplify ioc_release_fn()
block: replace icq->changed with icq->flags
Fix for unused symbols in switch warnings.
Link: http://lkml.kernel.org/r/20120313230302.GA1514@m.redhat.com
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a machine boots up, the TSC generally gets reset. However,
when kexec is used to boot into a kernel, the TSC value would be
carried over from the previous kernel. The computation of
cycns_offset in set_cyc2ns_scale is prone to an overflow, if the
machine has been up more than 208 days prior to the kexec. The
overflow happens when we multiply *scale, even though there is
enough room to store the final answer.
We fix this issue by decomposing tsc_now into the quotient and
remainder of division by CYC2NS_SCALE_FACTOR and then performing
the multiplication separately on the two components.
Refactor code to share the calculation with the previous
fix in __cycles_2_ns().
Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: john stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/20120310004027.19291.88460.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's a few awkward printk()s inside of scheduler guts that people
prefer to keep but really are rather deadlock prone. Fudge around it
by storing the text in a per-cpu buffer and poll it using the existing
printk_tick() handler.
This will drop output when its more frequent than once a tick, however
only the affinity thing could possible go that fast and for that just
one should suffice to notify the admin he's done something silly..
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-wua3lmkt3dg8nfts66o6brne@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a new thread handler is created, an irqaction is passed to it as
data. Not only that irqaction is stored in task_struct by the handler
for later use, but also a structure associated with the kernel thread
keeps this value as long as the thread exists.
This fix kicks irqaction out off task_struct. Yes, I introduce new bit
field. But it allows not only to eliminate the duplicate, but also
shortens size of task_struct.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120309135925.GB2114@dhcp-26-207.brq.redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull networking from David Miller:
1) IPV4 routing metrics can become stale when routes are changed by the
administrator, fix from Steffen Klassert.
2) atl1c does "val |= XXX;" where XXX is a bit number not a bit mask,
fix by using set_bit. From Dan Carpenter.
3) Memory accounting bug in carl9170 driver results in wedged TX queue.
Fix from Nicolas Cavallari.
4) iwlwifi accidently uses "sizeof(ptr)" instead of "sizeof(*ptr)", fix
from Johannes Berg.
5) Openvswitch doesn't honor dp_ifindex when doing vport lookups, fix
from Ben Pfaff.
6) ehea conversion to 64-bit stats lost multicast and rx_errors
accounting, fix from Eric Dumazet.
7) Bridge state transition logging in br_stp_disable_port() is busted,
it's emitted at the wrong time and the message is in the wrong tense,
fix from Paulius Zaleckas.
8) mlx4 device erroneously invokes the queue resize firmware operation
twice, fix from Jack Morgenstein.
9) Fix deadlock in usbnet, need to drop lock when invoking usb_unlink_urb()
otherwise we recurse into taking it again. Fix from Sebastian Siewior.
10) hyperv network driver uses the wrong driver name string, fix from
Haiyang Zhang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net/hyperv: Use the built-in macro KBUILD_MODNAME for this driver
net/usbnet: avoid recursive locking in usbnet_stop()
route: Remove redirect_genid
inetpeer: Invalidate the inetpeer tree along with the routing cache
mlx4_core: fix bug in modify_cq wrapper for resize flow.
atl1c: set ATL1C_WORK_EVENT_RESET bit correctly
bridge: fix state reporting when port is disabled
bridge: br_log_state() s/entering/entered/
ehea: restore multicast and rx_errors fields
openvswitch: Fix checksum update for actions on UDP packets.
openvswitch: Honor dp_ifindex, when specified, for vport lookup by name.
iwlwifi: fix wowlan suspend
mwifiex: reset encryption mode flag before association
carl9170: fix frame delivery if sta is in powersave mode
carl9170: Fix memory accounting when sta is in power-save mode.
Fixes up a duplicate #include, adds an empty implementation of
of_find_compatible_node() and make git ignore .dtb files. And fix
up bus name on OF described PHYs. Nothing exciting here.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPWPvsAAoJEEFnBt12D9kB8QsP/2e6jv7CQ/wUhihzw126Y8yi
4HLG3ByJ382/c3LXg3juC+WfMvlpqEa0yqhikFKgIsm7cNbv7PkPS583pujdSXo0
IRykiT3Q9+XbO7L+9C88miYIQT8+IT/AjIjfuRwlP4gLPNUJhknpYhYOc09YzwOp
zhheeGVyeyTay+beQXip8gOEq2dYEl4IKsItagCBZfkDvj3Y8yDwTlP7f2j0FYZi
uODa3NFKE4uc0U2chtro0Vt87TRfbnIQ93SbvEWyQBWMEbPqT3l1Q94AeFGubCLj
RX910WvurCE4evzo2ZJzXPt9gPEyGIgtMGbjh+cENfT5/wNB2HWk1ftKnss5yEjp
WmcbwWKwXPwRPc5EpRdgabBCRgouCZghtcnJpkoXKSwbEt/nj3no8GOZXQnv+rSL
Ga0BSk1jJYC9DRpaIrLvdxXZF7vy1nug8fgU9ALi2R0gTmN+k6tHlLh60EWMSCBF
YCXv3fUd9IVJfOYsu4qMk0Pu40pW9rDc698Nxiep7C0iNhlddq8fiHTwB3DdoinC
iYmz+wEdDDp/68qY/mguXXtr5GAFSz1PUOwSJkh2zW6LnBDEv1djfhoA+5gwZX2v
I1IwfNHMn/XAEuhOFWMo9CejB0rDoBhigwucH8EilbIdBYVCCEtYaubO/Gq6HmId
h7hx9sWjrwF9HY0N9EHX
=/fxH
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull minor devicetree bug fixes and documentation updates from Grant Likely:
"Fixes up a duplicate #include, adds an empty implementation of
of_find_compatible_node() and make git ignore .dtb files. And fix up
bus name on OF described PHYs. Nothing exciting here."
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux-2.6:
doc: dt: Fix broken reference in gpio-leds documentation
of/mdio: fix fixed link bus name
of/fdt.c: asm/setup.h included twice
of: add picochip vendor prefix
dt: add empty of_find_compatible_node function
ARM: devicetree: Add .dtb files to arch/arm/boot/.gitignore
As we invalidate the inetpeer tree along with the routing cache now,
we don't need a genid to reset the redirect handling when the routing
cache is flushed.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We initialize the routing metrics with the values cached on the
inetpeer in rt_init_metrics(). So if we have the metrics cached on the
inetpeer, we ignore the user configured fib_metrics.
To fix this issue, we replace the old tree with a fresh initialized
inet_peer_base. The old tree is removed later with a delayed work queue.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull ARM updates from Russell King.
* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
ARM: 7358/1: perf: add PMU hotplug notifier
ARM: 7357/1: perf: fix overflow handling for xscale2 PMUs
ARM: 7356/1: perf: check that we have an event in the PMU IRQ handlers
ARM: 7355/1: perf: clear overflow flag when disabling counter on ARMv7 PMU
ARM: 7354/1: perf: limit sample_period to half max_period in non-sampling mode
ARM: ecard: ensure fake vma vm_flags is setup
ARM: 7346/1: errata: fix PL310 erratum #753970 workaround selection
ARM: 7345/1: errata: update workaround for A9 erratum #743622
ARM: 7348/1: arm/spear600: fix one-shot timer
ARM: 7339/1: amba/serial.h: Include types.h for resolving dependency of type bool
Merge the emailed seties of 19 patches from Andrew Morton
* akpm:
rapidio/tsi721: fix queue wrapping bug in inbound doorbell handler
memcg: fix mapcount check in move charge code for anonymous page
mm: thp: fix BUG on mm->nr_ptes
alpha: fix 32/64-bit bug in futex support
memcg: fix GPF when cgroup removal races with last exit
debugobjects: Fix selftest for static warnings
floppy/scsi: fix setting of BIO flags
memcg: fix deadlock by inverting lrucare nesting
drivers/rtc/rtc-r9701.c: fix crash in r9701_remove()
c2port: class_create() returns an ERR_PTR
pps: class_create() returns an ERR_PTR, not NULL
hung_task: fix the broken rcu_lock_break() logic
vfork: kill PF_STARTING
coredump_wait: don't call complete_vfork_done()
vfork: make it killable
vfork: introduce complete_vfork_done()
aio: wake up waiters when freeing unused kiocbs
kprobes: return proper error code from register_kprobe()
kmsg_dump: don't run on non-error paths by default
When moving tasks from old memcg (with move_charge_at_immigrate on new
memcg), followed by removal of old memcg, hit General Protection Fault in
mem_cgroup_lru_del_list() (called from release_pages called from
free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
exit_mmap from mmput from exit_mm from do_exit).
Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
still trying to update its stats, and take page off lru before freeing.
A task, or a charge, or a page on lru: each secures a memcg against
removal. In this case, the last task has been moved out of the old memcg,
and it is exiting: anonymous pages are uncharged one by one from the
memcg, as they are zapped from its pagetables, so the charge gets down to
0; but the pages themselves are queued in an mmu_gather for freeing.
Most of those pages will be on lru (and force_empty is careful to
lru_add_drain_all, to add pages from pagevec to lru first), but not
necessarily all: perhaps some have been isolated for page reclaim, perhaps
some isolated for other reasons. So, force_empty may find no task, no
charge and no page on lru, and let the removal proceed.
There would still be no problem if these pages were immediately freed; but
typically (and the put_page_testzero protocol demands it) they have to be
added back to lru before they are found freeable, then removed from lru
and freed. We don't see the issue when adding, because the
mem_cgroup_iter() loops keep their own reference to the memcg being
scanned; but when it comes to mem_cgroup_lru_del_list().
I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
38c5d72f3e ("memcg: simplify LRU handling by new rule") mercifully
removed those convolutions, but left this General Protection Fault.
But it's surprisingly easy to restore the old behaviour: just check
PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
change? just going back to how it worked before; testing, and an audit of
uses of pc->mem_cgroup, show no problem.
And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
longer has any purpose, and we can safely revert 4e5f01c2b9 ("memcg:
clear pc->mem_cgroup if necessary").
Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
is not strictly necessary: the lru_lock there, with RCU before memcg
structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
without that; but it seems cleaner to rely on one dependency less.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously it was (ab)used by utrace. Then it was wrongly used by the
scheduler code.
Currently it is not used, kill it before it finds the new erroneous user.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that CLONE_VFORK is killable, coredump_wait() no longer needs
complete_vfork_done(). zap_threads() should find and kill all tasks with
the same ->mm, this includes our parent if ->vfork_done is set.
mm_release() becomes the only caller, unexport complete_vfork_done().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make vfork() killable.
Change do_fork(CLONE_VFORK) to do wait_for_completion_killable(). If it
fails we do not return to the user-mode and never touch the memory shared
with our child.
However, in this case we should clear child->vfork_done before return, we
use task_lock() in do_fork()->wait_for_vfork_done() and
complete_vfork_done() to serialize with each other.
Note: now that we use task_lock() we don't really need completion, we
could turn task->vfork_done into "task_struct *wake_up_me" but this needs
some complications.
NOTE: this and the next patches do not affect in-kernel users of
CLONE_VFORK, kernel threads run with all signals ignored including
SIGKILL/SIGSTOP.
However this is obviously the user-visible change. Not only a fatal
signal can kill the vforking parent, a sub-thread can do execve or
exit_group() and kill the thread sleeping in vfork().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes.
Move the clear-and-complete-vfork_done code into the new trivial helper,
complete_vfork_done().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 04c6862c05 ("kmsg_dump: add kmsg_dump() calls to the
reboot, halt, poweroff and emergency_restart paths"), kmsg_dump() gets
run on normal paths including poweroff and reboot.
This is less than ideal given pstore implementations that can only
represent single backtraces, since a reboot may overwrite a stored oops
before it's been picked up by userspace. In addition, some pstore
backends may have low performance and provide a significant delay in
reboot as a result.
This patch adds a printk.always_kmsg_dump kernel parameter (which can also
be changed from userspace). Without it, the code will only be run on
failure paths rather than on normal paths. The option can be enabled in
environments where there's a desire to attempt to audit whether or not a
reboot was cleanly requested or not.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) TCP SACK processing can calculate an incorrect reordering value in
some cases, fix from Neal Cardwell.
2) tcp_mark_head_lost() can split SKBs in situations where it should
not, violating send queue invariants expected by other pieces of
code and thus resulting (eventually) in corrupted retransmit state
counters. Also from Neal Cardwell.
3) qla3xxx erroneously calls spin_lock_irqrestore() with constant
hw_flags of zero. Fix from Santosh Nayak.
4) Fix NULL deref in rt2x00, from Gabor Juhos.
5) pch_gbe passes address of wrong typed object to pch_gbe_validate_option
thus corrupting part of the value. From Dan Carpenter.
6) We must check the return value of nlmsg_parse() before trying to use
the results. From Eric Dumazet.
7) Bridging code fails to check return value of ipv6_dev_get_saddr()
thus potentially leaving uninitialized garbage in the outgoing ipv6
header. From Ulrich Weber.
8) Due to rounding and a reversed operation on jiffies, bridge message
ages can go backwards instead of forwards, thus breaking STP. Fixes
from Joakim Tjernlund.
9) r8169 modifies Config* registers without properly holding the
Config9346 lock, resulting in corrupted IP fragments on some chips.
Fix from Francois Romieu.
10) NET_PACKET_ENGINE default wan't set properly during the network
driver mega-move. Fix from Stephen Hemminger.
11) vmxnet3 uses TCP header size where it actually should use the UDP
header size, fix from Shreyas Bhatewara.
12) Netfilter bridge module autoload is busted in the compat case, fix
from Florian Westphal.
13) Wireless Key removal was not setting multicast bits correctly thus
accidently killing the unicast key 0 and thus all traffic stops.
Fix from Johannes Berg.
14) Fix endless retries of A-MPDU transmissions in brcm80211 driver.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
qla3xxx: ethernet: Fix bogus interrupt state flag.
bridge: check return value of ipv6_dev_get_saddr()
rtnetlink: fix rtnl_calcit() and rtnl_dump_ifinfo()
bridge: message age needs to increase, not decrease.
bridge: Adjust min age inc for HZ > 256
tcp: don't fragment SACKed skbs in tcp_mark_head_lost()
r8169: corrupted IP fragments fix for large mtu.
packetengines: fix config default
vmxnet3: Fix transport header size
enic: fix an endian bug in enic_probe()
pch_gbe: memory corruption calling pch_gbe_validate_option()
tg3: Fix tg3_get_stats64 for 5700 / 5701 devs
tcp: fix false reordering signal in tcp_shifted_skb
tcp: fix comment for tp->highest_sack
netfilter: bridge: fix module autoload in compat case
brcm80211: smac: only print block-ack timeout message at trace level
brcm80211: smac: fix endless retry of A-MPDU transmissions
mac80211: Fix a warning on changing to monitor mode from STA
mac80211: zero initialize count field in ieee80211_tx_rate
iwlwifi: fix key removal
...
Pull per-cpu patches from Tejun Heo:
"This pull request contains four patches. One replaces manual clearing
with bitmap_clear(), two fix generic definition of __this_cpu ops so
that they don't choose unnecessarily strict arch version. One makes
_this_cpu definition use raw_local_irq_*() so that it doesn't end up
wrecking irq on/off state tracking when used from inside lockdep.
Of the four patches, the raw_local_irq_*() update is the most
important, so please feel free to cherry pick only that one patch and
ignore the rest if you want to - commit e920d5971d 'percpu: use
raw_local_irq_* in _this_cpu op'."
* 'for-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: fix __this_cpu_{sub,inc,dec}_return() definition
percpu: use raw_local_irq_* in _this_cpu op
percpu: fix generic definition of __this_cpu_add_and_return()
percpu: use bitmap_clear
With branch stack sampling, it is possible to filter by priv levels.
In system-wide mode, that means it is possible to capture only user
level branches. The builtin SW LBR filter needs to disassemble code
based on LBR captured addresses. For that, it needs to know the task
the addresses are associated with. Because of context switches, the
content of the branch stack buffer may contain addresses from
different tasks.
We need a callback on context switch to either flush the branch stack
or save it. This patch adds a new callback in struct pmu which is called
during context switches. The callback is called only when necessary.
That is when a system-wide context has, at least, one event which
uses PERF_SAMPLE_BRANCH_STACK. The callback is never called for
per-thread context.
In this version, the Intel x86 code simply flushes (resets) the LBR
on context switches (fills it with zeroes). Those zeroed branches are
then filtered out by the SW filter.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-11-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the ability to sample taken branches to the
perf_event interface.
The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.
This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel x86 it is
implemented on top of the Last Branch Record (LBR) facility.
To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.
Sampled taken branches may be filtered by type and/or priv
levels.
The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.
Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.
The following generic filters are currently defined:
- PERF_SAMPLE_USER
only branches whose targets are at the user level
- PERF_SAMPLE_KERNEL
only branches whose targets are at the kernel level
- PERF_SAMPLE_HV
only branches whose targets are at the hypervisor level
- PERF_SAMPLE_ANY
any type of branches (subject to priv levels filters)
- PERF_SAMPLE_ANY_CALL
any call branches (may incl. syscall on some arch)
- PERF_SAMPLE_ANY_RET
any return branches (may incl. syscall returns on some arch)
- PERF_SAMPLE_IND_CALL
indirect call branches
Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event. Use of kernel, hv priv levels is subject to permissions
and availability (hv).
The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's only used inside fs/dcache.c, and we're going to play games with it
for the word-at-a-time patches. This time we really don't even want to
export it, because it really is an internal function to fs/dcache.c, and
has been since it was introduced.
Having it in that extremely hot header file (it's included in pretty
much everything, thanks to <linux/fs.h>) is a disaster for testing
different versions, and is utterly pointless.
We really should have some kind of header file diet thing, where we
figure out which parts of header files are really better off private and
only result in more expensive compiles.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds missed "__" prefixes, otherwise these functions
works as irq/preemption safe.
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJPUqnkAAoJEDeqqVYsXL0MyTwH/14R9sIt+IdQgMGje6Dys6U1
3wxmNHcxNBB461212i+XVToH72GiPtUyWY8Q3LT+Qcq4Rab9oq8wDffxdgflDLja
sZq9COLR8dmXJM0hduyuysCiNIlmuURV20uIMsDbYutBVtAKKqMd7ZrJBmkVh7xT
slBJf+0lcD5IBYPYwf8lbxAI7E/C5TarMCIZk4z21I8ovF321RlGcyyo6NM62q/P
wvLw4bJuW7nJa3q3B7BznzBcoHLmzo9tiY+CbtQlGhSdasDKJ8HuAegTVyofl6s8
0/RTvaUcqTWUlFxXzLVDT+MCWAUP4GFR66tR2T5tygRi9st70TcpMzISiyZzKi8=
=WzAJ
-----END PGP SIGNATURE-----
Merge tag 'parisc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6
PARISC fixes from James Bottomley:
"This is a set of build fixes to get the cross compiled architecture
testbeds building again"
* tag 'parisc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6:
[PARISC] don't unconditionally override CROSS_COMPILE for 64 bit.
[PARISC] include <linux/prefetch.h> in drivers/parisc/iommu-helpers.h
[PARISC] fix compile break caused by iomap: make IOPORT/PCI mapping functions conditional
It did some odd things for unclear reasons. As this is one of the
functions that gets changed when doing word-at-a-time compares, this is
yet another of the "don't change any semantics, but clean things up so
that subsequent patches don't get obscured by the cleanups".
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. and also use it in lookup_one_len() rather than open-coding it.
There aren't any performance-critical users, so inlining it is silly.
But it wouldn't matter if it wasn't for the fact that the word-at-a-time
dentry name patches want to conditionally replace the function, and
uninlining it sets the stage for that.
So again, this is a preparatory patch that doesn't change any semantics,
and only prepares for a much cleaner and testable word-at-a-time dentry
name accessor patch.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These don't change any semantics, but they clean up the code a bit and
mark some arguments appropriately 'const'.
They came up as I was doing the word-at-a-time dcache name accessor
code, and cleaning this up now allows me to send out a smaller relevant
interesting patch for the experimental stuff.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is only one error code to return for a bad user-space buffer
pointer passed to a system call in the same address space as the
system call is executed, and that is EFAULT. Furthermore, the
low-level access routines, which catch most of the faults, return
EFAULT already.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The regset common infrastructure assumed that regsets would always
have .get and .set methods, but not necessarily .active methods.
Unfortunately people have since written regsets without .set methods.
Rather than putting in stub functions everywhere, handle regsets with
null .get or .set methods explicitly.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pass nice as a value to proc_sched_autogroup_set_nice().
No side effect is expected, and the variable err will be overwritten with
the return value.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F45FBB7.5090607@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch (as1519) fixes a bug in the block layer's disk-events
polling. The polling is done by a work routine queued on the
system_nrt_wq workqueue. Since that workqueue isn't freezable, the
polling continues even in the middle of a system sleep transition.
Obviously, polling a suspended drive for media changes and such isn't
a good thing to do; in the case of USB mass-storage devices it can
lead to real problems requiring device resets and even re-enumeration.
The patch fixes things by creating a new system-wide, non-reentrant,
freezable workqueue and using it for disk-events polling.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: <stable@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
__blkdev_get() calls rescan_partitions() to remove
in-kernel partition structures and raise KOBJ_CHANGE uevent.
However it ends up calling driver's revalidate_disk without open
and could cause oops.
In the case of SCSI:
process A process B
----------------------------------------------
sys_open
__blkdev_get
sd_open
returns -ENOMEDIUM
scsi_remove_device
<scsi_device torn down>
rescan_partitions
sd_revalidate_disk
<oops>
Oopses are reported here:
http://marc.info/?l=linux-scsi&m=132388619710052
This patch separates the partition invalidation from rescan_partitions()
and use it for -ENOMEDIUM case.
Reported-by: Huajun Li <huajun.li.lee@gmail.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>