This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When remounting cgroupfs with some subsystems added to it and some
removed, cgroup will remove all the files in root directory and then
re-popluate it.
What I'm doing here is, only remove files which belong to subsystems that
are to be unbinded, and only create files for newly-added subsystems.
The purpose is to have all other files untouched.
This is a preparation for cgroup xattr support.
v7:
- checkpatch warnings fixed
v6:
- no changes
v5:
- no changes
v4:
- refactored cgroup_clear_directory() to not use cgroup_rm_file()
- instead of going thru the list of files, get the file list using the
subsystems
- use 'subsys_mask' instead of {added,removed}_bits and made
cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
v3:
- refresh patches after recent refactoring
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull timer fixes from Thomas Gleixner:
"Mostly small fixes for the fallout of the timekeeping overhaul in 3.6
along with stable fixes to address an accumulation problem and missing
sanity checks for RTC readouts and user space provided values."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Avoid making adjustments if we haven't accumulated anything
time: Avoid potential shift overflow with large shift values
time: Fix casting issue in timekeeping_forward_now
time: Ensure we normalize the timekeeper in tk_xtime_add
time: Improve sanity checking of timekeeping inputs
The function graph has a test to check if the frame pointer is
corrupted, which can happen with various options of gcc with mcount.
But this is not an issue with -mfentry as -mfentry does not need nor use
frame pointers for function graph tracing.
Link: http://lkml.kernel.org/r/20120807194059.773895870@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Thanks to Andi Kleen, gcc 4.6.0 now supports -mfentry for x86
(and hopefully soon for other archs). What this does is to have
the function profiler start at the beginning of the function
instead of after the stack is set up. As plain -pg (mcount) is
called after the stack is set up, and in some cases can have issues
with the function graph tracer. It also requires frame pointers to
be enabled.
The -mfentry now calls __fentry__ at the beginning of the function.
This allows for compiling without frame pointers and even has the
ability to access parameters if needed.
If the architecture and the compiler both support -mfentry then
use that instead.
Link: http://lkml.kernel.org/r/20120807194059.392617243@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We still patch SMP instructions to UP variants if we boot with a
single CPU, but not at any other time. In particular, not if we
unplug CPUs to return to a single cpu.
Paul McKenney points out:
mean offline overhead is 6251/48=130.2 milliseconds.
If I remove the alternatives_smp_switch() from the offline
path [...] the mean offline overhead is 550/42=13.1 milliseconds
Basically, we're never going to get those 120ms back, and the
code is pretty messy.
We get rid of:
1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the
documentation is wrong. It's now the default.
2) The skip_smp_alternatives flag used by suspend.
3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end()
which were only used to set this one flag.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paul.mckenney@us.ibm.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Noticed when digging into a suspend issue in linux-next (next-20120821).
For more details see <http://marc.info/?t=134554708000002&r=1&w=2>.
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
If update_wall_time() is called and the current offset isn't large
enough to accumulate, avoid re-calling timekeeping_adjust which may
change the clock freq and can cause 1ns inconsistencies with
CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andreas Schwab noticed that the 1 << tk->shift could overflow if the
shift value was greater than 30, since 1 would be a 32bit long on
32bit architectures. This issue was introduced by 1e75fa8be (time:
Condense timekeeper.xtime into xtime_sec)
Use 1ULL instead to ensure we don't overflow on the shift.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
arch_gettimeoffset returns a u32 value which when shifted by tk->shift
can overflow. This issue was introduced with 1e75fa8be (time: Condense
timekeeper.xtime into xtime_sec)
Cast it to u64 first.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andreas noticed problems with resume on specific hardware after commit
1e75fa8b (time: Condense timekeeper.xtime into xtime_sec) combined
with commit b44d50dca (time: Fix casting issue in tk_set_xtime and
tk_xtime_add)
After some digging I realized we aren't normalizing the timekeeper
after the add. Add the missing normalize call.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Tested-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
cancel_delayed_work() can't be called from IRQ handlers due to its use
of del_timer_sync() and can't cancel work items which are already
transferred from timer to worklist.
Also, unlike other flush and cancel functions, a canceled delayed_work
would still point to the last associated cpu_workqueue. If the
workqueue is destroyed afterwards and the work item is re-used on a
different workqueue, the queueing code can oops trying to dereference
already freed cpu_workqueue.
This patch reimplements cancel_delayed_work() using
try_to_grab_pending() and set_work_cpu_and_clear_pending(). This
allows the function to be called from IRQ handlers and makes its
behavior consistent with other flush / cancel functions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Up to now, for delayed_works, try_to_grab_pending() couldn't be used
from IRQ handlers because IRQs may happen while
delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN.
This patch makes delayed_work use the new TIMER_IRQSAFE flag for
delayed_work->timer. This makes try_to_grab_pending() and thus
mod_delayed_work_on() safe to call from IRQ handlers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull audit-tree fixes from Miklos Szeredi:
"The audit subsystem maintainers (Al and Eric) are not responding to
repeated resends. Eric did ack them a while ago, but no response
since then. So I'm sending these directly to you."
* 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
audit: clean up refcounting in audit-tree
audit: fix refcounting in audit-tree
audit: don't free_chunk() after fsnotify_add_mark()
It seems commit 4a9d4b024a ("switch fput to task_work_add") re-
introduced the problem addressed in 944be0b224 ("close_files(): add
scheduling point")
If a server process with a lot of files (say 2 million tcp sockets) is
killed, we can spend a lot of time in task_work_run() and trigger a soft
lockup.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Timer internals are protected with irq-safe locks but timer execution
isn't, so a timer being dequeued for execution and its execution
aren't atomic against IRQs. This makes it impossible to wait for its
completion from IRQ handlers and difficult to shoot down a timer from
IRQ handlers.
This issue caused some issues for delayed_work interface. Because
there's no way to reliably shoot down delayed_work->timer from IRQ
handlers, __cancel_delayed_work() can't share the logic to steal the
target delayed_work with cancel_delayed_work_sync(), and can only
steal delayed_works which are on queued on timer. Similarly, the
pending mod_delayed_work() can't be used from IRQ handlers.
This patch adds a new timer flag TIMER_IRQSAFE, which makes the timer
to be executed without enabling IRQ after dequeueing such that its
dequeueing and execution are atomic against IRQ handlers.
This makes it safe to wait for the timer's completion from IRQ
handlers, for example, using del_timer_sync(). It can never be
executing on the local CPU and if executing on other CPUs it won't be
interrupted until done.
This will enable simplifying delayed_work cancel/mod interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-5-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Over time, timer initializers became messy with unnecessarily
duplicated code which are inconsistently spread across timer.h and
timer.c.
This patch cleans up timer initializers.
* timer.c::__init_timer() is renamed to do_init_timer().
* __TIMER_INITIALIZER() added. It takes @flags and all initializers
are wrappers around it.
* init_timer[_on_stack]_key() now take @flags.
* __init_timer[_on_stack]() added. They take @flags and all init
macros are wrappers around them.
* __setup_timer[_on_stack]() added. It uses __init_timer() and takes
@flags. All setup macros are wrappers around the two.
Note that this patch doesn't add missing init/setup combinations -
e.g. init_timer_deferrable_on_stack(). Adding missing ones is
trivial.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-4-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To prepare for addition of another flag, generalize timer->base flags
handling.
* Rename from TBASE_*_FLAG to TIMER_* and make them LU constants.
* Define and use TIMER_FLAG_MASK for flags masking so that multiple
flags can be handled correctly.
* Don't dereference timer->base directly even if
!tbase_get_deferrable(). All two such places are already passed in
@base, so use it instead.
* Make sure tvec_base's alignment is large enough for timer->base
flags using BUILD_BUG_ON().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-2-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Export dummy_irq_chip to modules to allow them to do things such as
irq_set_chip_and_handler(virq,
&dummy_irq_chip,
handle_level_irq);
This fixes
ERROR: "dummy_irq_chip" [drivers/gpio/gpio-pcf857x.ko] undefined!
when gpio-pcf857x.c is being built as a module.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Greg KH <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/871ujstrp6.wl%25kuninori.morimoto.gx@renesas.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Export irq_set_chip_and_handler_name() to modules to allow them to
do things such as
irq_set_chip_and_handler(....);
This fixes
ERROR: "irq_set_chip_and_handler_name" \
[drivers/gpio/gpio-pcf857x.ko] undefined!
when gpio-pcf857x.c is being built as a module.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Greg KH <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/873948trpk.wl%25kuninori.morimoto.gx@renesas.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=843640
If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path
does unmap_region() but does not remove the soon-to-be-freed vma
from rb tree. Actually there are more problems but this is how
William noticed this bug.
Perhaps we could do do_munmap() + return in this case, but in
fact it is simply wrong to abort if uprobe_mmap() fails. Until
at least we move the !UPROBE_COPY_INSN code from
install_breakpoint() to uprobe_register().
For example, uprobe_mmap()->install_breakpoint() can fail if the
probed insn is not supported (remember, uprobe_register()
succeeds if nobody mmaps inode/offset), mmap() should not fail
in this case.
dup_mmap()->uprobe_mmap() is wrong too by the same reason,
fork() can race with uprobe_register() and fail for no reason if
it wins the race and does install_breakpoint() first.
And, if nothing else, both mmap_region() and dup_mmap() return
success if uprobe_mmap() fails. Change them to ignore the error
code from uprobe_mmap().
Reported-and-tested-by: William Cohen <wcohen@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # v3.5
Cc: Anton Arapov <anton@redhat.com>
Cc: William Cohen <wcohen@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
. Fix include order for bison/flex-generated C files, from Ben Hutchings
. Build fixes and documentation corrections from David Ahern
. Group parsing support, from Jiri Olsa
. UI/gtk refactorings and improvements from Namhyung Kim
. NULL deref fix for perf script, from Namhyung Kim
. Assorted cleanups from Robert Richter
. Let O= makes handle relative paths, from Steven Rostedt
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJQMkGhAAoJENZQFvNTUqpAqjsQAJE5iD1LFogC8o/WjvRHz0TY
Y0x+sR/XfW61KYpeq5g+UaKuFU3P44ijCoyks3y5sza97DkYgUwMpEHlLXFSM8Pp
sNOapqY57s24nq3MLrhH1V9w+cSE+m2u/Gi5fGLCQekio9gkOBwYxNGk7vpKri/n
LBRsMozBu/mZjMy20uWOb7Uk8xsAToh+TFaAtjyQ9Snn9nNJj49NUAp37uN888H/
ducMLq32HN5v/6Zd3q6IWdDWgZsHLkIa3R5FIs/GNe3Dih07gtYLmDol4ktPbTFm
yoaWpP5wbtu/62EZlJwE393vMuoeqN/96394ZZQGFafhHVxN4+rcBhXbejBs0T2b
wk/0CzntW8bbUAI/cl3SB9aui//FWOxcjG9aDQ7PsmHzPw1Q4VD0F9Mcod4p+dRX
PsA9q/tST1eAiwzWYthDtj81U7iChINcXKhoZn2xn6+0+aMH+6FFNBmCH8MR5aCU
BvrXhTJjvau/Ym/sILl4Tf4wfssTq49yMsn/YKCwLJ0hg0XlTObWfQRy2MOayXH9
NJvUE+9GSXoTEKhmr1AfTYEG9vObaXZyFwAI74xvPPwUYojCb4ZjEKmG0egW+VGk
IJKFCaJZwwVsGau4aIbFAMP12/L8Qs/Ox91ddCJ0j5TIlSGMaqW5lbV1N1crzlTT
a0GsN49NvhbFttBXrcNX
=0a2X
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
* Fix include order for bison/flex-generated C files, from Ben Hutchings
* Build fixes and documentation corrections from David Ahern
* Group parsing support, from Jiri Olsa
* UI/gtk refactorings and improvements from Namhyung Kim
* NULL deref fix for perf script, from Namhyung Kim
* Assorted cleanups from Robert Richter
* Let O= makes handle relative paths, from Steven Rostedt
* perf script python fixes, from Feng Tang.
* Improve 'perf lock' error message when the needed tracepoints
are not present, from David Ahern.
* Initial bash completion support, from Frederic Weisbecker
* Allow building without libelf, from Namhyung Kim.
* Support DWARF CFI based unwind to have callchains when %bp
based unwinding is not possible, from Jiri Olsa.
* Symbol resolution fixes, while fixing support PPC64 files with an .opt ELF
section was the end goal, several fixes for code that handles all
architectures and cleanups are included, from Cody Schafer.
* Add a description for the JIT interface, from Andi Kleen.
* Assorted fixes for Documentation and build in 32 bit, from Robert Richter
* Add support for non-tracepoint events in perf script python, from Feng Tang
* Cache the libtraceevent event_format associated to each evsel early, so that we
avoid relookups, i.e. calling pevent_find_event repeatedly when processing
tracepoint events.
[ This is to reduce the surface contact with libtraceevents and make clear what
is that the perf tools needs from that lib: so far parsing the common and per
event fields. ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull ftrace updates from Steve Rostedt:
" This patch series extends ftrace function tracing utility to be
more dynamic for its users. It allows for data passing to the callback
functions, as well as reading regs as if a breakpoint were to trigger
at function entry.
The main goal of this patch series was to allow kprobes to use ftrace
as an optimized probe point when a probe is placed on an ftrace nop.
With lots of help from Masami Hiramatsu, and going through lots of
iterations, we finally came up with a good solution. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
convert all users to system[_freezable]_wq.
If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
Please use system[_freezable]_wq instead.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Now that all workqueues are non-reentrant, system[_freezable]_wq() are
equivalent to system_nrt[_freezable]_wq(). Replace the latter with
wrappers around system[_freezable]_wq(). The wrapping goes through
inline functions so that __deprecated can be added easily.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
are equivalent to flush[_delayed]_work(). Drop the separate
implementation and make them thin wrappers around
flush[_delayed]_work().
* start_flush_work() no longer takes @wait_executing as the only left
user - flush_work() - always sets it to %true.
* __cancel_work_timer() uses flush_work() instead of wait_on_work().
Signed-off-by: Tejun Heo <tj@kernel.org>
By default, each per-cpu part of a bound workqueue operates separately
and a work item may be executing concurrently on different CPUs. The
behavior avoids some cross-cpu traffic but leads to subtle weirdities
and not-so-subtle contortions in the API.
* There's no sane usefulness in allowing a single work item to be
executed concurrently on multiple CPUs. People just get the
behavior unintentionally and get surprised after learning about it.
Most either explicitly synchronize or use non-reentrant/ordered
workqueue but this is error-prone.
* flush_work() can't wait for multiple instances of the same work item
on different CPUs. If a work item is executing on cpu0 and then
queued on cpu1, flush_work() can only wait for the one on cpu1.
Unfortunately, work items can easily cross CPU boundaries
unintentionally when the queueing thread gets migrated. This means
that if multiple queuers compete, flush_work() can't even guarantee
that the instance queued right before it is finished before
returning.
* flush_work_sync() was added to work around some of the deficiencies
of flush_work(). In addition to the usual flushing, it ensures that
all currently executing instances are finished before returning.
This operation is expensive as it has to walk all CPUs and at the
same time fails to address competing queuer case.
Incorrectly using flush_work() when flush_work_sync() is necessary
is an easy error to make and can lead to bugs which are difficult to
reproduce.
* Similar problems exist for flush_delayed_work[_sync]().
Other than the cross-cpu access concern, there's no benefit in
allowing parallel execution and it's plain silly to have this level of
contortion for workqueue which is widely used from core code to
extremely obscure drivers.
This patch makes all workqueues non-reentrant. If a work item is
executing on a different CPU when queueing is requested, it is always
queued to that CPU. This guarantees that any given work item can be
executing on one CPU at maximum and if a work item is queued and
executing, both are on the same CPU.
The only behavior change which may affect workqueue users negatively
is that non-reentrancy overrides the affinity specified by
queue_work_on(). On a reentrant workqueue, the affinity specified by
queue_work_on() is always followed. Now, if the work item is
executing on one of the CPUs, the work item will be queued there
regardless of the requested affinity. I've reviewed all workqueue
users which request explicit affinity, and, fortunately, none seems to
be crazy enough to exploit parallel execution of the same work item.
This adds an additional busy_hash lookup if the work item was
previously queued on a different CPU. This shouldn't be noticeable
under any sane workload. Work item queueing isn't a very
high-frequency operation and they don't jump across CPUs all the time.
In a micro benchmark to exaggerate this difference - measuring the
time it takes for two work items to repeatedly jump between two CPUs a
number (10M) of times with busy_hash table densely populated, the
difference was around 3%.
While the overhead is measureable, it is only visible in pathological
cases and the difference isn't huge. This change brings much needed
sanity to workqueue and makes its behavior consistent with timer. I
think this is the right tradeoff to make.
This enables significant simplification of workqueue API.
Simplification patches will follow.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixed some checkpatch warnings.
tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit.
Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix migration thread runtime bogosity
sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled
sched,cgroup: Fix up task_groups list
sched: fix divide by zero at {thread_group,task}_times
sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
The archs that implement virtual cputime accounting all
flush the cputime of a task when it gets descheduled
and sometimes set up some ground initialization for the
next task to account its cputime.
These archs all put their own hooks in their context
switch callbacks and handle the off-case themselves.
Consolidate this by creating a new account_switch_vtime()
callback called in generic code right after a context switch
and that these archs must implement to flush the prev task
cputime and initialize the next task cputime related state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Extract cputime code from the giant sched/core.c and
put it in its own file. This make it easier to deal with
this particular area and de-bloat a bit more core.c
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Merge alpha architecture update from Michael Cree:
"The Alpha Maintainer, Matt Turner, is currently unavailable, so I have
collected up patches that have been posted to the linux-alpha mailing
list over the last couple of months, and are forwarding them to you in
the hope that you are prepared to accept them via me.
The patches by Al Viro and myself I have been running against kernels
for two months now so have had quite a bit of testing. All except one
patch were intended for the 3.5 kernel but because of Matt's
unavailability never got forwarded to you."
* emailed patches from Michael Cree <mcree@orcon.net.nz>: (9 commits)
alpha: Fix fall-out from disintegrating asm/system.h
Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts
alpha: fix fpu.h usage in userspace
alpha/mm/fault.c: Port OOM changes to do_page_fault
alpha: take kernel_execve() out of entry.S
alpha: take a bunch of syscalls into osf_sys.c
alpha: Use new generic strncpy_from_user() and strnlen_user()
alpha: Wire up cross memory attach syscalls
alpha: Don't export SOCK_NONBLOCK to user space.
New helper: current_thread_info(). Allows to do a bunch of odd syscalls
in C. While we are at it, there had never been a reason to do
osf_getpriority() in assembler. We also get "namespace"-aware (read:
consistent with getuid(2), etc.) behaviour from getx?id() syscalls now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Michael Cree <mcree@orcon.net.nz>
Acked-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syscall_get_nr can return -1 in the case that the task is not executing
a system call.
This patch fixes perf_syscall_{enter,exit} to check that the syscall
number is valid before using it as an index into a bitmap.
Link: http://lkml.kernel.org/r/1345137254-7377-1-git-send-email-will.deacon@arm.com
Cc: Jason Baron <jbaron@redhat.com>
Cc: Wade Farnsworth <wade_farnsworth@mentor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.
tj: CPU up/downs care quite a bit about latency these days. This
shouldn't hurt anything and makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.
To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.
tj: Rephrased comment a bit.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.
It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.
tj: Even apart from rebinding, having system_highpri_wq generally
makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We assign cpu id into work struct's data field in __queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do __queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into
the following sub-optimal path in case of WQ_NON_REENTRANT.
gcwq = get_gcwq(cpu);
if (wq->flags & WQ_NON_REENTRANT &&
(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
It is sufficient to prevent to go into sub-optimal path.
tj: Slightly rephrased the comment.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When we do tracing workqueue_queue_work(), it records requested cpu.
But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
requested cpu is changed as local cpu.
In case of @wq->flag & WQ_UNBOUND, above change is not occured,
therefore it is reasonable to correct it.
Use temporary local variable for storing requested cpu.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Unexpected behavior could occur if the time is set to a value large
enough to overflow a 64bit ktime_t (which is something larger then the
year 2262).
Also unexpected behavior could occur if large negative offsets are
injected via adjtimex.
So this patch improves the sanity check timekeeping inputs by
improving the timespec_valid() check, and then makes better use of
timespec_valid() to make sure we don't set the time to an invalid
negative value or one that overflows ktime_t.
Note: This does not protect from setting the time close to overflowing
ktime_t and then letting natural accumulation cause the overflow.
Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1344454580-17031-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Drop the initial reference by fsnotify_init_mark early instead of
audit_tree_freeing_mark() at destroy time.
In the cases we destroy the mark before we drop the initial reference we need to
get rid of the get_mark that balances the put_mark in audit_tree_freeing_mark().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Refcounting of fsnotify_mark in audit tree is broken. E.g:
refcount
create_chunk
alloc_chunk 1
fsnotify_add_mark 2
untag_chunk
fsnotify_get_mark 3
fsnotify_destroy_mark
audit_tree_freeing_mark 2
fsnotify_put_mark 1
fsnotify_put_mark 0
via destroy_list
fsnotify_mark_destroy -1
This was reported by various people as triggering Oops when stopping auditd.
We could just remove the put_mark from audit_tree_freeing_mark() but that would
break freeing via inode destruction. So this patch simply omits a put_mark
after calling destroy_mark or adds a get_mark before.
The additional get_mark is necessary where there's no other put_mark after
fsnotify_destroy_mark() since it assumes that the caller is holding a reference
(or the inode is keeping the mark pinned, not the case here AFAICS).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Valentin Avram <aval13@gmail.com>
Reported-by: Peter Moody <pmoody@google.com>
Acked-by: Eric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org
Don't do free_chunk() after fsnotify_add_mark(). That one does a delayed unref
via the destroy list and this results in use-after-free.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Eric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org
There is a least one modular user so export free_pid_ns so modules can
capture and use the pid namespace on the very rare occasion when it
makes sense.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Correct a long standing omission and use struct pid in the owner
field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS.
This guarantees we don't have issues when pid wraparound occurs.
Use a kuid_t in the owner field of struct ip6_flowlabel when the
share type is IPV6_FL_S_USER to add user namespace support.
In /proc/net/ip6_flowlabel capture the current pid namespace when
opening the file and release the pid namespace when the file is
closed ensuring we print the pid owner value that is meaning to
the reader of the file. Similarly use from_kuid_munged to print
uid values that are meaningful to the reader of the file.
This requires exporting pid_nr_ns so that ipv6 can continue to built
as a module. Yoiks what silliness
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Any operation which clears PENDING should be preceded by a wmb to
guarantee that the next PENDING owner sees all the changes made before
PENDING release.
There are only two places where PENDING is cleared -
set_work_cpu_and_clear_pending() and clear_work_data(). The caller of
the former already does smp_wmb() but the latter doesn't have any.
Move the wmb above set_work_cpu_and_clear_pending() into it and add
one to clear_work_data().
There hasn't been any report related to this issue, and, given how
clear_work_data() is used, it is extremely unlikely to have caused any
actual problems on any architecture.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
delayed_work encodes the workqueue to use and the last CPU in
delayed_work->work.data while it's on timer. The target CPU is
implicitly recorded as the CPU the timer is queued on and
delayed_work_timer_fn() queues delayed_work->work to the CPU it is
running on.
Unfortunately, this leaves flush_delayed_work[_sync]() no way to find
out which CPU the delayed_work was queued for when they try to
re-queue after killing the timer. Currently, it chooses the local CPU
flush is running on. This can unexpectedly move a delayed_work queued
on a specific CPU to another CPU and lead to subtle errors.
There isn't much point in trying to save several bytes in struct
delayed_work, which is already close to a hundred bytes on 64bit with
all debug options turned off. This patch adds delayed_work->cpu to
remember the CPU it's queued for.
Note that if the timer is migrated during CPU down, the work item
could be queued to the downed global_cwq after this change. As a
detached global_cwq behaves like an unbound one, this doesn't change
much for the delayed_work.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Since power saving code was removed from sched now, the implement
code is out of service in this function, and even pollute other logical.
like, 'want_sd' never has chance to be set '0', that remove the effect
of SD_WAKE_AFFINE here.
So, clean up the obsolete code, includes SD_PREFER_LOCAL.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As we already have dst_rq in lb_env, using or changing "this_rq" do not
make sense.
This patch will replace "this_rq" with dst_rq in load_balance, and we
don't need to change "this_rq" while process LBF_SOME_PINNED any more.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/501F8357.3070102@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds a comment on top of the schedule() function to explain
to scheduler newbies how the main scheduler function is entered.
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Explained-by: Ingo Molnar <mingo@kernel.org>
Explained-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344070187-2420-1-git-send-email-penberg@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It should be sched_nr_latency so fix it before it annoys me more.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344435364-18632-1-git-send-email-bp@amd64.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make stop scheduler class do the same accounting as other classes,
Migration threads can be caught in the act while doing exec balancing,
leading to the below due to use of unmaintained ->se.exec_start. The
load that triggered this particular instance was an apparently out of
control heavily threaded application that does system monitoring in
what equated to an exec bomb, with one of the VERY frequently migrated
tasks being ps.
%CPU PID USER CMD
99.3 45 root [migration/10]
97.7 53 root [migration/12]
97.0 57 root [migration/13]
90.1 49 root [migration/11]
89.6 65 root [migration/15]
88.7 17 root [migration/3]
80.4 37 root [migration/8]
78.1 41 root [migration/9]
44.2 13 root [migration/2]
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With multiple instances of task_groups, for_each_rt_rq() is a noop,
no task groups having been added to the rt.c list instance. This
renders __enable/disable_runtime() and print_rt_stats() noop, the
user (non) visible effect being that rt task groups are missing in
/proc/sched_debug.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: stable@kernel.org # v3.3+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.
This problem can be triggered in practice on very long lived processes:
PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
#0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
#1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
#2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
#3 [ffff880472a51cd0] die at ffffffff8100f26b
#4 [ffff880472a51d00] do_trap at ffffffff814f03f4
#5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
#6 [ffff880472a51e00] divide_error at ffffffff8100be7b
[exception RIP: thread_group_times+0x56]
RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
#8 [ffff880472a51f40] sys_times at ffffffff81088524
#9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b
Cc: stable@vger.kernel.org
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.
His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.
It was found that:
schedule()
idle_balance()
load_balance()
local_irq_save()
double_rq_lock()
update_h_load()
walk_tg_tree(tg_load_down)
tg_load_down()
Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.
This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.
I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).
Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: Peter Portante <pportant@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU
kthread code to use the new smp_hotplug_thread facility.
[ tglx: Adapted it to use callbacks and to the simplified rcu yield ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.673354828@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.563736676@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ paulmck: Call rcu_note_context_switch() with interrupts enabled. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.456416747@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Because kernel subsystems need their per-CPU kthreads on UP systems as
well as on SMP systems, the smpboot hotplug kthread functions must be
provided in UP builds as well as in SMP builds. This commit therefore
adds smpboot.c to UP builds and excludes irrelevant code via #ifdef.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Provide a generic interface for setting up and tearing down percpu
threads.
On registration the threads for already online cpus are created and
started. On deregistration (modules) the threads are stoppped.
During hotplug operations the threads are created, started, parked and
unparked. The datastructure for registration provides a pointer to
percpu storage space and optional setup, cleanup, park, unpark
functions. These functions are called when the thread state changes.
Each implementation has to provide a function which is queried and
returns whether the thread should run and the thread function itself.
The core code handles all state transitions and avoids duplicated code
in the call sites.
[ paulmck: Preemption leak fix ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.352501068@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To avoid the full teardown/setup of per cpu kthreads in the case of
cpu hot(un)plug, provide a facility which allows to put the kthread
into a park position and unpark it when the cpu comes online again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120716103948.236618824@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The rcu_yield() code is amazing. It's there to avoid starvation of the
system when lots of (boosting) work is to be done.
Now looking at the code it's functionality is:
Make the thread SCHED_OTHER and very nice, i.e. get it out of the way
Arm a timer with 2 ticks
schedule()
Now if the system goes idle the rcu task returns, regains SCHED_FIFO
and plugs on. If the systems stays busy the timer fires and wakes a
per node kthread which in turn makes the per cpu thread SCHED_FIFO and
brings it back on the cpu. For the boosting thread the "make it FIFO"
bit is missing and it just runs some magic boost checks. Now this is a
lot of code with extra threads and complexity.
It's way simpler to let the tasks when they detect overload schedule
away for 2 ticks and defer the normal wakeup as long as they are in
yielded state and the cpu is not idle.
That solves the same problem and the only difference is that when the
cpu goes idle it's not guaranteed that the thread returns right away,
but it won't be longer out than two ticks, so no harm is done. If
that's an issue than it is way simpler just to wake the task from
idle as RCU has callbacks there anyway.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120716103948.131256723@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Fix for two recent regressions in the generic PM domains framework.
* Revert of a commit that introduced a resume regression and is conceptually
incorrect in my opinion.
* Fix for a return value in pcc-cpufreq.c from Julia Lawall.
* RTC wakeup signaling fix from Neil Brown.
* Suppression of compiler warnings for CONFIG_PM_SLEEP unset in ACPI,
platform/x86 and TPM drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQJWxZAAoJEKhOf7ml8uNs1EcP/ApgCk1SfMo779Lcq8OQVVqq
2jbtoqnsuPMs/rl4VrW1adJspEkWb39KgE5XIlfg6tIKm5nuIauFtJEGskMq00w7
8PT7bQOSJdLKIOjsBEUugUtp+HZO0iUuGahciQf4V11eOAZKODqxtomL8Ry2mY3P
gDohYBa3J+xnkvRqKUY0k0OkSNDDlI3+y+WPr+tamjDzT5uqjWLR9LJ1+1eGtmou
6DrgjD3eOus/r53OXKlNldXc9HbzVdnmoZwMNtswlNTaCL7HkdpRnPClSWt+NvVi
cOviJ6F4d6FRmYRFvatFEaXmSAfpB9v/dt1C9VYtoLyZsZWs1sRGd/bxgCofYWnE
GZckKl8pI80u14345P9R+QF3CculV2itfbKBiXxWunmOeokBYIz5sWdTh4mNg/vy
VZdeO9jJy2542aF8P9Up9EE3IjkrEz7gEL0Sv4hfmEoHI1jKJDdAn/9/lmfrujPh
e3vpBeqlBmSTU0rKj97x/G8zwWhPscqJDPkDUEEe+wfS3oPvhymYesV1bF7OCNwr
WMMcFoDuSRzZ1lvEY7w4IWAKRDCqjaJ1kkBZvzoOEIC4gi4i3pAehpYEZMNFtFrf
RB2z5Jx1Z1w0LOgcz69TTMY274kZ8N/v7/SVUBk5+tSs1VNHo/p+WYGqW/8ExSvH
D4H8kQvz8uBK23g7ekVR
=lo6A
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael J. Wysocki:
- Fix for two recent regressions in the generic PM domains framework.
- Revert of a commit that introduced a resume regression and is
conceptually incorrect in my opinion.
- Fix for a return value in pcc-cpufreq.c from Julia Lawall.
- RTC wakeup signaling fix from Neil Brown.
- Suppression of compiler warnings for CONFIG_PM_SLEEP unset in ACPI,
platform/x86 and TPM drivers.
* tag 'pm-for-3.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
tpm_tis / PM: Fix unused function warning for CONFIG_PM_SLEEP
platform / x86 / PM: Fix unused function warnings for CONFIG_PM_SLEEP
ACPI / PM: Fix unused function warnings for CONFIG_PM_SLEEP
Revert "NMI watchdog: fix for lockup detector breakage on resume"
PM: Make dev_pm_get_subsys_data() always return 0 on success
drivers/cpufreq/pcc-cpufreq.c: fix error return code
RTC: Avoid races between RTC alarm wakeup and suspend.
While tracking down a weird buffer overflow issue in a program that
looked to be sane, I started double checking the length returned by
syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing
the buffer.
Sure enough, it was. I saw this in strace:
11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279
It turns out that the loops that calculate how much space the entries
will take when they're copied don't include the newlines and prefixes
that will be included in the final output since prev flags is passed as
zero.
This patch properly accounts for it and fixes the overflow.
CC: stable@kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introducing following bits to the the perf_event_attr struct:
- exclude_callchain_kernel to filter out kernel callchain
from the sample dump
- exclude_callchain_user to filter out user callchain
from the sample dump
We need to be able to disable standard user callchain dump when we use
the dwarf cfi callchain mode, because frame pointer based user
callchains are useless in this mode.
Implementing also exclude_callchain_kernel to have complete set of
options.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
[ Added kernel callchains filtering ]
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-7-git-send-email-jolsa@redhat.com
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Introducing PERF_SAMPLE_STACK_USER sample type bit to trigger the dump
of the user level stack on sample. The size of the dump is specified by
sample_stack_user value.
Being able to dump parts of the user stack, starting from the stack
pointer, will be useful to make a post mortem dwarf CFI based stack
unwinding.
Added HAVE_PERF_USER_STACK_DUMP config option to determine if the
architecture provides user stack dump on perf event samples. This needs
access to the user stack pointer which is not unified across
architectures. Enabling this for x86 architecture.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-6-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Introducing perf_output_skip function to be able to skip data within the
perf ring buffer.
When writing data into perf ring buffer we first reserve needed place in
ring buffer and then copy the actual data.
There's a possibility we won't be able to fill all the reserved size
with data, so we need a way to skip the remaining bytes.
This is going to be useful when storing the user stack dump, where we
might end up with less data than we originally requested.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-5-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adding a generic way to use __output_copy function with specific copy
function via DEFINE_PERF_OUTPUT_COPY macro.
Using this to add new __output_copy_user function, that provides output
copy from user pointers. For x86 the copy_from_user_nmi function is used
and __copy_from_user_inatomic for the rest of the architectures.
This new function will be used in user stack dump on sample, coming in
next patches.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-4-git-send-email-jolsa@redhat.com
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Introducing PERF_SAMPLE_REGS_USER sample type bit to trigger the dump of
user level registers on sample. Registers we want to dump are specified
by sample_regs_user bitmask.
Only user level registers are dumped at the moment. Meaning the register
values of the user space context as it was before the user entered the
kernel for whatever reason (syscall, irq, exception, or a PMI happening
in userspace).
The layout of the sample_regs_user bitmap is described in
asm/perf_regs.h for archs that support register dump.
This is going to be useful to bring Dwarf CFI based stack unwinding on
top of samples.
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
[ Dump registers ABI specification. ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Suggested-by: Stephane Eranian <eranian@google.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-3-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Revert commit 45226e9 (NMI watchdog: fix for lockup detector breakage
on resume) which breaks resume from system suspend on my SH7372
Mackerel board (by causing a NULL pointer dereference to happen) and
is generally wrong, because it abuses the CPU hotplug functionality
in a shamelessly blatant way.
The original issue should be addressed through appropriate syscore
resume callback instead.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Add missing initialization for ret variable. Its initialization
is based on the re_cnt variable, which is being set deep down
in the ftrace_function_filter_re function.
I'm not sure compilers would be smart enough to see this in near
future, so killing the warning this way.
Link: http://lkml.kernel.org/r/1340120894-9465-2-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The warkeup_rt self test used msleep() calls to wait for real time
tasks to wake up and run. On bare-metal hardware, this was enough as
the scheduler should let the RT task run way before the non-RT task
wakes up from the msleep(). If it did not, then that would mean the
scheduler was broken.
But when dealing with virtual machines, this is a different story.
If the RT task wakes up on a VCPU, it's up to the host to decide when
that task gets to schedule, which can be far behind the time that the
non-RT task wakes up. In this case, the test would fail incorrectly.
As we are not testing the scheduler, but instead the wake up tracing,
we can use completions to wait and not depend on scheduler timings
to see if events happen on time.
Link: http://lkml.kernel.org/r/1343663105.3847.7.camel@fedora
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Tetsuo Handa reported that sporadically the system clock starts
counting up too quickly which is enough to confuse the hangcheck
timer to print a bogus stall warning.
Commit 2a8c0883 "time: Move xtime_nsec adjustment underflow handling
timekeeping_adjust" overlooked this exit path:
} else
return;
which should really be a proper exit sequence, fixing the bug as a
side effect.
Also make the flow more readable by properly balancing curly
braces.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: john.stultz@linaro.org
Cc: a.p.zijlstra@chello.nl
Cc: richardcochran@gmail.com
Cc: prarit@redhat.com
Link: http://lkml.kernel.org/r/20120804192114.GA28347@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull futex fixes from Ingo Molnar:
"A couple of futex fixes from Darren Hart: two bugs reported by Dave
Jones (found with his trinity test) and Dan Carpenter through static
analysis. The third found while debugging the first two."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi()
futex: Fix bug in WARN_ON for NULL q.pi_state
futex: Test for pi_mutex on fault in futex_wait_requeue_pi()
Pull timer fixes from Ingo Molnar:
"One regression fix, and a couple of cleanups that clean up the code
flow in areas that had high-profile bugs recently."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Remove all direct references to timekeeper
time: Clean up offs_real/wall_to_mono and offs_boot/total_sleep_time updates
time: Clean up stray newlines
time/jiffies: Rename ACTHZ to SHIFTED_HZ
time/jiffies: Allow CLOCK_TICK_RATE to be undefined
time: Fix casting issue in tk_set_xtime and tk_xtime_add
Pull scheduler fixes from Ingo Molnar:
"Fixes and two late cleanups"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cleanups: Add load balance cpumask pointer to 'struct lb_env'
sched: Fix comment about PREEMPT_ACTIVE bit location
sched: Fix minor code style issues
sched: Use task_rq_unlock() in __sched_setscheduler()
sched/numa: Add SD_PERFER_SIBLING to CPU domain
Pull perf fixes from Ingo Molnar:
"Fix merge window fallout and fix sleep profiling (this was always
broken, so it's not a fix for the merge window - we can skip this one
from the head of the tree)."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/trace: Add ability to set a target task for events
perf/x86: Fix USER/KERNEL tagging of samples properly
perf/x86/intel/uncore: Make UNCORE_PMU_HRTIMER_INTERVAL 64-bit
Pull irq fix from Ingo Molnar.
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Allow irq chips to mark themself oneshot safe
usb-dbgp - increase the controller wait time to come out of halt.
kdb - Remove unused KDB_FLAG_ONLY_DO_DUMP code and cpu in more prompt
debug core - pass NMI type on archs that provide NMI types
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQG8wyAAoJEIciOldedpOjN2oP/ipaQSLnvoKUhutFl/qL2239
mMsxh9ga9rfKuCujpSkHZUwjo3VX7put7cnhVwETd4y2gN2YWPYg4OIKt+Y0AhNe
4NzHwB+lm6iGE33Q1x4uEHBH5aWLzWcOM/9n4avwY2DjtDfpecki5ChP/CVHK8qU
VVF2PfY8nbxcEonCbP1b/0KaD3xrPqwgZ70HdFi5eUuXiBajAyp9c9zqVUWJ6j+H
r+2PVkzn9NxRCkyq3tzK5gYk5SzoJPClkpB67CWugG35MiFLkz2csJNztFxtaInZ
t8HLkMTVLdCgLZqnw/ZEVWfqQA+q6N5NS6zs9j0siLg5HbEb+UtCebPwpChdQrmh
Sol+0vmT9Hi4Jm6onhDnQYchaDI7gMhynUC9sWAPhtSHS9e7D9c5IBLHQd3YbOHK
c8ELzxduszw8+jaiDJStkWM+tbQzJXD9bT1KpLVJd8t9BKAmuBX7ETSD+eKtjynJ
SgywSfVOdxXzEMvRWqeK3qgkLJYCWCfFsc+75hzJl18dRoM3NDyuOxKyOLWF9tFV
QUjaCvndIFz7CgM7FTToJZbACqxFHRGh4UUXHiXPUBPXvE5Zt34n4VsxAXYpJc95
1by/TBcYKWPd3hsOJOh1qMeKXD6TEwN/eEmsjBfnHTp8EBcgRtvue8ZHPMcfYnn/
Iauy66b/nHJLRsi1gWrn
=LAWq
-----END PGP SIGNATURE-----
Merge tag 'for_linux-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull KGDB/KDB/usb-dbgp fixes and cleanups from Jason Wessel:
"There are no new features, those will be delayed to the 3.7 window.
There are only fixes/cleanup against the usual kernel churn and we are
removing more lines than we add:
- usb-dbgp - increase the controller wait time to come out of halt.
- kdb - Remove unused KDB_FLAG_ONLY_DO_DUMP code and cpu in more prompt
- debug core - pass NMI type on archs that provide NMI types"
* tag 'for_linux-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
USB: echi-dbgp: increase the controller wait time to come out of halt.
kernel/debug: Make use of KGDB_REASON_NMI
kdb: Remove cpu from the more prompt
kdb: Remove unused KDB_FLAG_ONLY_DO_DUMP
Workqueue was lacking a mechanism to modify the timeout of an already
pending delayed_work. delayed_work users have been working around
this using several methods - using an explicit timer + work item,
messing directly with delayed_work->timer, and canceling before
re-queueing, all of which are error-prone and/or ugly.
This patch implements mod_delayed_work[_on]() which behaves similarly
to mod_timer() - if the delayed_work is idle, it's queued with the
given delay; otherwise, its timeout is modified to the new value.
Zero @delay guarantees immediate execution.
v2: Updated to reflect try_to_grab_pending() changes. Now safe to be
called from bh context.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
There can be two reasons try_to_grab_pending() can fail with -EAGAIN.
One is when someone else is queueing or deqeueing the work item. With
the previous patches, it is guaranteed that PENDING and queued state
will soon agree making it safe to busy-retry in this case.
The other is if multiple __cancel_work_timer() invocations are racing
one another. __cancel_work_timer() grabs PENDING and then waits for
running instances of the target work item on all CPUs while holding
PENDING and !queued. try_to_grab_pending() invoked from another task
will keep returning -EAGAIN while the current owner is waiting.
Not distinguishing the two cases is okay because __cancel_work_timer()
is the only user of try_to_grab_pending() and it invokes
wait_on_work() whenever grabbing fails. For the first case, busy
looping should be fine but wait_on_work() doesn't cause any critical
problem. For the latter case, the new contender usually waits for the
same condition as the current owner, so no unnecessarily extended
busy-looping happens. Combined, these make __cancel_work_timer()
technically correct even without irq protection while grabbing PENDING
or distinguishing the two different cases.
While the current code is technically correct, not distinguishing the
two cases makes it difficult to use try_to_grab_pending() for other
purposes than canceling because it's impossible to tell whether it's
safe to busy-retry grabbing.
This patch adds a mechanism to mark a work item being canceled.
try_to_grab_pending() now disables irq on success and returns -EAGAIN
to indicate that grabbing failed but PENDING and queued states are
gonna agree soon and it's safe to busy-loop. It returns -ENOENT if
the work item is being canceled and it may stay PENDING && !queued for
arbitrary amount of time.
__cancel_work_timer() is modified to mark the work canceling with
WORK_OFFQ_CANCELING after grabbing PENDING, thus making
try_to_grab_pending() fail with -ENOENT instead of -EAGAIN. Also, it
invokes wait_on_work() iff grabbing failed with -ENOENT. This isn't
necessary for correctness but makes it consistent with other future
users of try_to_grab_pending().
v2: try_to_grab_pending() was testing preempt_count() to ensure that
the caller has disabled preemption. This triggers spuriously if
!CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by
Fengguang Wu.
v3: Updated so that try_to_grab_pending() disables irq on success
rather than requiring preemption disabled by the caller. This
makes busy-looping easier and will allow try_to_grap_pending() to
be used from bh/irq contexts.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
* Use bool @is_dwork instead of @timer and let try_to_grab_pending()
use to_delayed_work() to determine the delayed_work address.
* Move timer handling from __cancel_work_timer() to
try_to_grab_pending().
* Make try_to_grab_pending() use -EAGAIN instead of -1 for
busy-looping and drop the ret local variable.
* Add proper function comment to try_to_grab_pending().
This makes the code a bit easier to understand and will ease further
changes. This patch doesn't make any functional change.
v2: Use @is_dwork instead of @timer.
Signed-off-by: Tejun Heo <tj@kernel.org>
Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain
WORK_STRUCT_FLAG_* and flush color. If the work item is queued, the
rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise,
WORK_STRUCT_CWQ is clear and the bits contain the last CPU number -
either a real CPU number or one of WORK_CPU_*.
Scheduled addition of mod_delayed_work[_on]() requires an additional
flag, which is used only while a work item is off queue. There are
more than enough bits to represent off-queue CPU number on both 32 and
64bits. This patch introduces WORK_OFFQ_FLAG_* which occupy the lower
part of the @work->data high bits while off queue. This patch doesn't
define any actual OFFQ flag yet.
Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds
the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to
make room for OFFQ flags.
To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong
cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON()
to check that there are enough bits to accomodate off-queue CPU number
is added.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
try_to_grab_pending() will be used by to-be-implemented
mod_delayed_work[_on](). Move try_to_grab_pending() and related
functions above queueing functions.
This patch only moves functions around.
Signed-off-by: Tejun Heo <tj@kernel.org>
If @delay is zero and the dealyed_work is idle, queue_delayed_work()
queues it for immediate execution; however, queue_delayed_work_on()
lacks this logic and always goes through timer regardless of @delay.
This patch moves 0 @delay handling logic from queue_delayed_work() to
queue_delayed_work_on() so that both functions behave the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Queueing functions have been using different methods to determine the
local CPU.
* queue_work() superflously uses get/put_cpu() to acquire and hold the
local CPU across queue_work_on().
* delayed_work_timer_fn() uses smp_processor_id().
* queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu
which is interpreted as the local CPU.
* flush_delayed_work[_sync]() were using raw_smp_processor_id().
* __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the
target workqueue is bound one but nobody uses this.
This patch converts all functions to uniformly use %WORK_CPU_UNBOUND
to indicate local CPU and use the local binding feature of
__queue_work(). unlikely() is dropped from %WORK_CPU_UNBOUND handling
in __queue_work().
Signed-off-by: Tejun Heo <tj@kernel.org>
delayed_work->timer.function is currently initialized during
queue_delayed_work_on(). Export delayed_work_timer_fn() and set
delayed_work timer function during delayed_work initialization
together with other fields.
This ensures the timer function is always valid on an initialized
delayed_work. This is to help mod_delayed_work() implementation.
To detect delayed_work users which diddle with the internal timer,
trigger WARN if timer function doesn't match on queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access
to the target work item. They first try to claim the bit and proceed
with queueing only after that succeeds and there's a window between
PENDING being set and the actual queueing where the task can be
interrupted or preempted.
There's also a similar window in process_one_work() when clearing
PENDING. A work item is dequeued, gcwq->lock is released and then
PENDING is cleared and the worker might get interrupted or preempted
between releasing gcwq->lock and clearing PENDING.
cancel[_delayed]_work_sync() tries to claim or steal PENDING. The
function assumes that a work item with PENDING is either queued or in
the process of being [de]queued. In the latter case, it busy-loops
until either the work item loses PENDING or is queued. If canceling
coincides with the above described interrupts or preemptions, the
canceling task will busy-loop while the queueing or executing task is
preempted.
This patch keeps irq disabled across claiming PENDING and actual
queueing and moves PENDING clearing in process_one_work() inside
gcwq->lock so that busy looping from PENDING && !queued doesn't wait
for interrupted/preempted tasks. Note that, in process_one_work(),
setting last CPU and clearing PENDING got merged into single
operation.
This removes possible long busy-loops and will allow using
try_to_grab_pending() from bh and irq contexts.
v2: __queue_work() was testing preempt_count() to ensure that the
caller has disabled preemption. This triggers spuriously if
!CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by
Fengguang Wu.
v3: Disable irq instead of preemption. IRQ will be disabled while
grabbing gcwq->lock later anyway and this allows using
try_to_grab_pending() from bh and irq contexts.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
WORK_STRUCT_PENDING is used to claim ownership of a work item and
process_one_work() releases it before starting execution. When
someone else grabs PENDING, all pre-release updates to the work item
should be visible and all updates made by the new owner should happen
afterwards.
Grabbing PENDING uses test_and_set_bit() and thus has a full barrier;
however, clearing doesn't have a matching wmb. Given the preceding
spin_unlock and use of clear_bit, I don't believe this can be a
problem on an actual machine and there hasn't been any related report
but it still is theretically possible for clear_pending to permeate
upwards and happen before work->entry update.
Add an explicit smp_wmb() before work_clear_pending().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
All queueing functions return 1 on success, 0 if the work item was
already pending. Update them to return bool instead. This signifies
better that they don't return 0 / -errno.
This is cleanup and doesn't cause any functional difference.
While at it, fix comment opening for schedule_work_on().
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, queue/schedule[_delayed]_work_on() are located below the
counterpart without the _on postifx even though the latter is usually
implemented using the former. Swap them.
This is cleanup and doesn't cause any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
__ptrace_may_access() is used within only kernel/ptrace.c.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Pull second vfs pile from Al Viro:
"The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
deadlock reproduced by xfstests 068), symlink and hardlink restriction
patches, plus assorted cleanups and fixes.
Note that another fsfreeze deadlock (emergency thaw one) is *not*
dealt with - the series by Fernando conflicts a lot with Jan's, breaks
userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
for massive vfsmount leak; this is going to be handled next cycle.
There probably will be another pull request, but that stuff won't be
in it."
Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
delousing target_core_file a bit
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
fs: Remove old freezing mechanism
ext2: Implement freezing
btrfs: Convert to new freezing mechanism
nilfs2: Convert to new freezing mechanism
ntfs: Convert to new freezing mechanism
fuse: Convert to new freezing mechanism
gfs2: Convert to new freezing mechanism
ocfs2: Convert to new freezing mechanism
xfs: Convert to new freezing code
ext4: Convert to new freezing mechanism
fs: Protect write paths by sb_start_write - sb_end_write
fs: Skip atime update on frozen filesystem
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
fs: Improve filesystem freezing handling
switch the protection of percpu_counter list to spinlock
nfsd: Push mnt_want_write() outside of i_mutex
btrfs: Push mnt_want_write() outside of i_mutex
fat: Push mnt_want_write() outside of i_mutex
...
Round of refactoring and enhancements to irq_domain infrastructure. This
series starts the process of simplifying irqdomain. The ultimate goal is
to merge LEGACY, LINEAR and TREE mappings into a single system, but had
to back off from that after some last minute bugs. Instead it mainly
reorganizes the code and ensures that the reverse map gets populated
when the irq is mapped instead of the first time it is looked up.
Merging of the irq_domain types is deferred to v3.7
In other news, this series adds helpers for creating static mappings on
a linear or tree mapping.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQGKKkAAoJEEFnBt12D9kB59gQAJnTjrihej1tr0OEkffIthGK
RyVI/DMo0jMgLs4K/rIo3Y+PdTSsNYd8x4R7ln8O7rNRQn8W6jE6NQgoMh51EvNc
FAltmTsBldq6hUNuz2FEnbmojBP4QklTzL8bAiXtX5EufWQsgMsP4guOuHXLCjEV
CkWYVk/slXEWJ8yYJc6GKVRvL+CNeiXVCTcOsYA0CI3ofN7O0rd+YAL314CRllIc
e5uARbWM+s9FJ/eXwCZP4+3jCmdI/CHJb284WldMc/mBD8Rbiqpb4kH6AZI+TH2O
CyiNEPWs6FG5eJPTID7HrOarXGzwYq/pvv8iG7Mh8NiKSae1C1HdkHelCjbLQ+pU
POya0fWF1Gvzlmw0gHik86dqaKjwb29btjj7SFg8KnQExWn2ifhsY70mM9wCTo3s
cwcQlssDIsARE83nttTFCoV/iAWh9AvTxafrXu/+9OKTjpsYlC8kgzdVjq5aAxON
JaAUK1OduTWRsd1TabKlh6naRXr9nRcLKikwKri2oYVKkj97wahBuib4ffzAcNqz
VklRBxTH6M+dz/t5NpcVyLXJpqzTN++QNdTAmeQG6LOnHJL4tpFTsx5sMa7ghmzX
LNpmp/AkVfP0MT7Drf0FUUx6iFA7sjANYzcepUVDrPGKHx0E3LyqbG5JKcC5LgM6
+UIoKAktF3vY7pdZJL9z
=ZUF/
-----END PGP SIGNATURE-----
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull irqdomain changes from Grant Likely:
"Round of refactoring and enhancements to irq_domain infrastructure.
This series starts the process of simplifying irqdomain. The ultimate
goal is to merge LEGACY, LINEAR and TREE mappings into a single
system, but had to back off from that after some last minute bugs.
Instead it mainly reorganizes the code and ensures that the reverse
map gets populated when the irq is mapped instead of the first time it
is looked up.
Merging of the irq_domain types is deferred to v3.7
In other news, this series adds helpers for creating static mappings
on a linear or tree mapping."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
irqdomain: Improve diagnostics when a domain mapping fails
irqdomain: eliminate slow-path revmap lookups
irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.
irqdomain: Eliminate dedicated radix lookup functions
irqdomain: Support for static IRQ mapping and association.
irqdomain: Always update revmap when setting up a virq
irqdomain: Split disassociating code into separate function
irq_domain: correct a minor wrong comment for linear revmap
irq_domain: Standardise legacy/linear domain selection
irqdomain: Make ops->map hook optional
irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY
irqdomain: Simple NUMA awareness.
devicetree: add helper inline for retrieving a node's full name
Merge Andrew's second set of patches:
- MM
- a few random fixes
- a couple of RTC leftovers
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
rtc/rtc-88pm80x: remove unneed devm_kfree
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
tmpfs: distribute interleave better across nodes
mm: remove redundant initialization
mm: warn if pg_data_t isn't initialized with zero
mips: zero out pg_data_t when it's allocated
memcg: gix memory accounting scalability in shrink_page_list
mm/sparse: remove index_init_lock
mm/sparse: more checks on mem_section number
mm/sparse: optimize sparse_index_alloc
memcg: add mem_cgroup_from_css() helper
memcg: further prevent OOM with too many dirty pages
memcg: prevent OOM with too many dirty pages
mm: mmu_notifier: fix freed page still mapped in secondary MMU
mm: memcg: only check anon swapin page charges for swap cache
mm: memcg: only check swap cache pages for repeated charging
mm: memcg: split swapin charge function into private and public part
mm: memcg: remove needless !mm fixup to init_mm when charging
mm: memcg: remove unneeded shmem charge type
...
from interrupts for /dev/random and /dev/urandom. The goal is to
addresses weaknesses discussed in the paper "Mining your Ps and Qs:
Detection of Widespread Weak Keys in Network Devices", by Nadia
Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will
be published in the Proceedings of the 21st Usenix Security Symposium,
August 2012. (See https://factorable.net for more information and an
extended version of the paper.)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8
lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+
LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T
6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw
03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi
3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67
AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd
QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K
iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0
NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0
TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC
bi5FC1OolqhvzVIdsqgt
=u7KM
-----END PGP SIGNATURE-----
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull random subsystem patches from Ted Ts'o:
"This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.
The goal is to addresses weaknesses discussed in the paper "Mining
your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman,
which will be published in the Proceedings of the 21st Usenix Security
Symposium, August 2012. (See https://factorable.net for more
information and an extended version of the paper.)"
Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
random: mix in architectural randomness in extract_buf()
dmi: Feed DMI table to /dev/random driver
random: Add comment to random_initialize()
random: final removal of IRQF_SAMPLE_RANDOM
um: remove IRQF_SAMPLE_RANDOM which is now a no-op
sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
[ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
...
This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.
Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle with
- thus the gfp to alloc flag mapping ignores the task flags when in
interrupts (hard or soft) context.
Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. This patch borrows the task flags from whatever process happens
to be preempted by the softirq. It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.
The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag
cannot leak back into the preempted process. This should be safe due to
the following reasons
Softirqs can run on multiple CPUs sure but the same task should not be
executing the same softirq code. Neither should the softirq
handler be preempted by any other softirq handler so the flags
should not leak to an unrelated softirq.
Softirqs re-enable hardware interrupts in __do_softirq() so can be
preempted by hardware interrupts so PF_MEMALLOC is inherited
by the hard IRQ. However, this is similar to a process in
reclaim being preempted by a hardirq. While PF_MEMALLOC is
set, gfp_to_alloc_flags() distinguishes between hard and
soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
flag.
If the softirq is deferred to ksoftirq then its flags may be used
instead of a normal tasks but as the softirq cannot be preempted,
the PF_MEMALLOC flag does not leak to other code by accident.
[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node. There's code to try
to achieve that in hotadd_new_pgdat() as below:
/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
mutex_lock(&zonelists_mutex);
build_all_zonelists(pgdat, NULL);
mutex_unlock(&zonelists_mutex);
But it doesn't work as expected. When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet. And build_all_zonelists() only builds zonelists for
online nodes as:
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
build_zonelists(pgdat);
build_zonelist_cache(pgdat);
}
Though we hope to create zonelist for the new pgdat, but it doesn't. So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more. But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.
For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.
Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.
Even for mprotect_fixup(), we can get the right result in the end.
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull perf updates from Ingo Molnar:
"The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
fixes) now that Arnaldo is back from vacation."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
uprobes: __replace_page() needs munlock_vma_page()
uprobes: Rename vma_address() and make it return "unsigned long"
uprobes: Fix register_for_each_vma()->vma_address() check
uprobes: Introduce vaddr_to_offset(vma, vaddr)
uprobes: Teach build_probe_list() to consider the range
uprobes: Remove insert_vm_struct()->uprobe_mmap()
uprobes: Remove copy_vma()->uprobe_mmap()
uprobes: Fix overflow in vma_address()/find_active_uprobe()
uprobes: Suppress uprobe_munmap() from mmput()
uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
uprobes: Clean up and document write_opcode()->lock_page(old_page)
uprobes: Kill write_opcode()->lock_page(new_page)
uprobes: __replace_page() should not use page_address_in_vma()
uprobes: Don't recheck vma/f_mapping in write_opcode()
perf/x86: Fix missing struct before structure name
perf/x86: Fix format definition of SNB-EP uncore QPI box
perf/x86: Make bitfield unsigned
perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
perf/x86: Add Intel Nehalem-EX uncore support
perf/x86: Fix typo in format definition of uncore PCU filter
...
Ingo noted that the numerous timekeeper.value references made
the timekeeping code ugly and caused many long lines that
had to be broken up. He recommended replacing timekeeper.value
references with tk->value.
This patch provides a local tk value for all top level time
functions and sets it to &timekeeper. Then all timekeeper
access is done via a tk pointer.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-6-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For performance reasons, we maintain ktime_t based duplicates of
wall_to_monotonic (offs_real) and total_sleep_time (offs_boot).
Since large problems could occur (such as the resume regression
on 3.5-rc7, or the leapsecond hrtimer issue) if these value
pairs were to be inconsistently updated, this patch this cleans
up how we modify these value pairs to ensure we are always
consistent.
As a side-effect this is also more efficient as we only
caulculate the duplicate values when they are changed,
rather then every update_wall_time call.
This also provides WARN_ONs to detect if future changes break
the invariants.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-5-git-send-email-john.stultz@linaro.org
[ Cleaned up minor style issues. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo noted that ACTHZ is a confusing name, and requested it
be renamed, so this patch renames ACTHZ to SHIFTED_HZ to
better describe it.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.
Now a target task can be set by using __perf_task().
The original idea and a draft patch belongs to Peter Zijlstra.
I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Inspired-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With this patch struct ld_env will have a pointer of the load balancing
cpumask and we don't need to pass a cpumask around anymore.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FFE8665.3080705@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add function tracer based kprobe optimization support
handlers on x86. This allows kprobes to use function
tracer for probing on mcount call.
Link: http://lkml.kernel.org/r/20120605102838.27845.26317.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Updated to new port of ftrace save regs functions ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Introduce function trace based kprobes optimization.
With using ftrace optimization, kprobes on the mcount calling
address, use ftrace's mcount call instead of breakpoint.
Furthermore, this optimization works with preemptive kernel
not like as current jump-based optimization. Of cource,
this feature works only if the probe is on mcount call.
Only if kprobe.break_handler is set, that probe is not
optimized with ftrace (nor put on ftrace). The reason why this
limitation comes is that this break_handler may be used only
from jprobes which changes ip address (for fetching the function
arguments), but function tracer ignores modified ip address.
Changes in v2:
- Fix ftrace_ops registering right after setting its filter.
- Unregister ftrace_ops if there is no kprobe using.
- Remove notrace dependency from __kprobes macro.
Link: http://lkml.kernel.org/r/20120605102832.27845.63461.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Break a big critical region into fine-grained pieces at
registering kprobe path. This helps us to solve circular
locking dependency when introducing ftrace-based kprobes.
Link: http://lkml.kernel.org/r/20120605102826.27845.81689.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently module_mutex is taken before kprobe_mutex, but this
can cause issues when we have kprobes register ftrace, as the ftrace
mutex is taken before enabling a tracepoint, which currently takes
the module mutex.
If module_mutex is taken before kprobe_mutex, then we can not
have kprobes use the ftrace infrastructure.
There seems to be no reason that the kprobe_mutex can't be taken
before the module_mutex. Running lockdep shows that it is safe
among the kernels I've run.
Link: http://lkml.kernel.org/r/20120605102814.27845.21047.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a new filter update interface ftrace_set_filter_ip()
to set ftrace filter by ip address, not only glob pattern.
Link: http://lkml.kernel.org/r/20120605102808.27845.67952.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add selftests to test the save-regs functionality of ftrace.
If the arch supports saving regs, then it will make sure that regs is
at least not NULL in the callback.
If the arch does not support saving regs, it makes sure that the
registering of the ftrace_ops that requests saving regs fails.
It then tests the registering of the ftrace_ops succeeds if the
'IF_SUPPORTED' flag is set. Then it makes sure that the regs passed to
the function is NULL.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add selftests to test the function tracing recursion protection actually
does work. It also tests if a ftrace_ops states it will perform its own
protection. Although, even if the ftrace_ops states it will protect itself,
the ftrace infrastructure may still provide protection if the arch does
not support all features or another ftrace_ops is registered.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As more users of the function tracer utility are being added, they do
not always add the necessary recursion protection. To protect from
function recursion due to tracing, if the callback ftrace_ops does not
specifically specify that it protects against recursion (by setting
the FTRACE_OPS_FL_RECURSION_SAFE flag), the list operation will be
called by the mcount trampoline which adds recursion protection.
If the flag is set, then the function will be called directly with no
extra protection.
Note, the list operation is called if more than one function callback
is registered, or if the arch does not support all of the function
tracer features.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently kernel never set KGDB_REASON_NMI. We do now, when we enter
KGDB/KDB from an NMI.
This is not to be confused with kgdb_nmicallback(), NMI callback is
an entry for the slave CPUs during CPUs roundup, but REASON_NMI is the
entry for the master CPU.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Having the CPU in the more prompt is completely redundent vs the
standard kdb prompt, and it also wastes 32 bytes on the stack.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
This code cleanup was missed in the original kdb merge, and this code
is simply not used at all. The code that was previously used to set
the KDB_FLAG_ONLY_DO_DUMP was removed prior to the initial kdb merge.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
When the requested range is outside of the root range the logic in
__reserve_region_with_split will cause an infinite recursion which will
overflow the stack as seen in the warning bellow.
This particular stack overflow was caused by requesting the
(100000000-107ffffff) range while the root range was (0-ffffffff). In
this case __request_resource would return the whole root range as
conflict range (i.e. 0-ffffffff). Then, the logic in
__reserve_region_with_split would continue the recursion requesting the
new range as (conflict->end+1, end) which incidentally in this case
equals the originally requested range.
This patch aborts looking for an usable range when the request does not
intersect with the root range. When the request partially overlaps with
the root range, it ajust the request to fall in the root range and then
continues with the new request.
When the request is modified or aborted errors and a stack trace are
logged to allow catching the errors in the upper layers.
[ 5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
[ 5.975150] Modules linked in:
[ 5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
[ 5.985324] Call Trace:
[ 5.987759] [<c1039dfc>] ? console_unlock+0x17b/0x18d
[ 5.992891] [<c1039620>] warn_slowpath_common+0x48/0x5d
[ 5.998194] [<c1031758>] ? sub_preempt_count+0x63/0x89
[ 6.003412] [<c1039644>] warn_slowpath_null+0xf/0x13
[ 6.008453] [<c1031758>] sub_preempt_count+0x63/0x89
[ 6.013499] [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
[ 6.018453] [<c10c6349>] add_partial+0x36/0x3b
[ 6.022973] [<c10c7c0a>] deactivate_slab+0x96/0xb4
[ 6.027842] [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
[ 6.034456] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[ 6.039842] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[ 6.045232] [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
[ 6.050710] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[ 6.056100] [<c103f78f>] kzalloc.constprop.5+0x29/0x38
[ 6.061320] [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
[ 6.067230] [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
...
[ 7.179057] [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
[ 7.184970] [<c17b4779>] reserve_region_with_split+0x30/0x42
[ 7.190709] [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
[ 7.196623] [<c17c9526>] pcibios_resource_survey+0x23/0x2a
[ 7.202184] [<c17cad8a>] pcibios_init+0x23/0x35
[ 7.206789] [<c17ca574>] pci_subsys_init+0x3f/0x44
[ 7.211659] [<c1002088>] do_one_initcall+0x72/0x122
[ 7.216615] [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
[ 7.221659] [<c17a27ff>] kernel_init+0xa6/0x118
[ 7.226265] [<c17a2759>] ? start_kernel+0x334/0x334
[ 7.231223] [<c14d7482>] kernel_thread_helper+0x6/0x10
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
register_sysctl_table() is a strange function, as it makes internal
allocations (a header) to register a sysctl_table. This header is a
handle to the table that is created, and can be used to unregister the
table. But if the table is permanent and never unregistered, the header
acts the same as a static variable.
Unfortunately, this allocation of memory that is never expected to be
freed fools kmemleak in thinking that we have leaked memory. For those
sysctl tables that are never unregistered, and have no pointer referencing
them, kmemleak will think that these are memory leaks:
unreferenced object 0xffff880079fb9d40 (size 192):
comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
[<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
[<ffffffff8110b852>] __kmalloc+0x107/0x153
[<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
[<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
[<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
[<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
[<ffffffff81afb0a1>] sysctl_init+0x10/0x14
[<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
[<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
[<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
[<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
[<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
[<ffffffffffffffff>] 0xffffffffffffffff
The sysctl_base_table used by sysctl itself is one such instance that
registers the table to never be unregistered.
Use kmemleak_not_leak() to suppress the kmemleak false positive.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The last line of vmcoreinfo note does not end with \n. Parsing all the
lines in note becomes easier if all lines end with \n instead of trying to
special case the last line.
I know at least one tool, vmcore-dmesg in kexec-tools tree which made the
assumption that all lines end with \n. I think it is a good idea to fix
it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function dup_task() may fail at the following function calls in the
following order.
0) alloc_task_struct_node()
1) alloc_thread_info_node()
2) arch_dup_task_struct()
Error by 0) is not a matter, it can just return. But error by 1) requires
releasing task_struct allocated by 0) before it returns. Likewise, error
by 2) requires releasing task_struct and thread_info allocated by 0) and
1).
The existing error handling calls free_task_struct() and
free_thread_info() which do not only release task_struct and thread_info,
but also call architecture specific arch_release_task_struct() and
arch_release_thread_info().
The problem is that task_struct and thread_info are not fully initialized
yet at this point, but arch_release_task_struct() and
arch_release_thread_info() are called with them.
For example, x86 defines its own arch_release_task_struct() that releases
a task_xstate. If alloc_thread_info_node() fails in dup_task(),
arch_release_task_struct() is called with task_struct which is just
allocated and filled with garbage in this error handling.
This actually happened with tools/testing/fault-injection/failcmd.sh
# env FAILCMD_TYPE=fail_page_alloc \
./tools/testing/fault-injection/failcmd.sh --times=100 \
--min-order=0 --ignore-gfp-wait=0 \
-- make -C tools/testing/selftests/ run_tests
In order to fix this issue, make free_{task_struct,thread_info}() not to
call arch_release_{task_struct,thread_info}() and call
arch_release_{task_struct,thread_info}() implicitly where needed.
Default arch_release_task_struct() and arch_release_thread_info() are
defined as empty by default. So this change only affects the
architectures which implement their own arch_release_task_struct() or
arch_release_thread_info() as listed below.
arch_release_task_struct(): x86, sh
arch_release_thread_info(): mn10300, tile
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To make way for "fork: fix error handling in dup_task()", which fixes the
errors more completely.
Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code can be replaced by vma_pages(). So use it to simplify
the code.
[akpm@linux-foundation.org: initialise `len' at its definition site]
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The system deadlocks (at least since 2.6.10) when
call_usermodehelper(UMH_WAIT_EXEC) request triggers
call_usermodehelper(UMH_WAIT_PROC) request.
This is because "khelper thread is waiting for the worker thread at
wait_for_completion() in do_fork() since the worker thread was created
with CLONE_VFORK flag" and "the worker thread cannot call complete()
because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper
thread cannot start processing UMH_WAIT_PROC request because the khelper
thread is waiting for the worker thread at wait_for_completion() in
do_fork()".
The easiest example to observe this deadlock is to use a corrupted
/sbin/hotplug binary (like shown below).
# : > /tmp/dummy
# chmod 755 /tmp/dummy
# echo /tmp/dummy > /proc/sys/kernel/hotplug
# modprobe whatever
call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from
kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a
module. do_execve("/tmp/dummy") triggers a call to
request_module("binfmt-0000") from search_binary_handler() which in turn
calls call_usermodehelper(UMH_WAIT_PROC).
In order to avoid deadlock, as a for-now and easy-to-backport solution, do
not try to call wait_for_completion() in call_usermodehelper_exec() if the
worker thread was created by khelper thread with CLONE_VFORK flag. Future
and fundamental solution might be replacing singleton khelper thread with
some workqueue so that recursive calls up to max_active dependency loop
can be handled without deadlock.
[akpm@linux-foundation.org: add comment to kmod_thread_locker]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vprintk_emit() prefix parsing should only be done for internal kernel
messages. This allows existing behavior to be kept in all cases.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current form of a KERN_<LEVEL> is "<.>".
Add printk_get_level and printk_skip_level functions to handle these
formats.
These functions centralize tests of KERN_<LEVEL> so a future modification
can change the KERN_<LEVEL> style and shorten the number of bytes consumed
by these headers.
[akpm@linux-foundation.org: fix build error and warning]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Wu Fengguang <wfg@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If argv_split() failed, the code will end up calling argv_free(NULL). Fix
it up and clean things up a bit.
Addresses Coverity report 703573.
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On the suspend/resume path the boot CPU does not go though an
offline->online transition. This breaks the NMI detector post-resume
since it depends on PMU state that is lost when the system gets
suspended.
Fix this by forcing a CPU offline->online transition for the lockup
detector on the boot CPU during resume.
To provide more context, we enable NMI watchdog on Chrome OS. We have
seen several reports of systems freezing up completely which indicated
that the NMI watchdog was not firing for some reason.
Debugging further, we found a simple way of repro'ing system freezes --
issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
after the system has been suspended/resumed one or more times.
With this patch in place, the system freeze result in panics, as
expected.
These panics provide a nice stack trace for us to debug the actual issue
causing the freeze.
[akpm@linux-foundation.org: fiddle with code comment]
[akpm@linux-foundation.org: make lockup_detector_bootcpu_resume() conditional on CONFIG_SUSPEND]
[akpm@linux-foundation.org: fix section errors]
Signed-off-by: Sameer Nanda <snanda@chromium.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
panic_lock is meant to ensure that panic processing takes place only on
one cpu; if any of the other cpus encounter a panic, they will spin
waiting to be shut down.
However, this causes a regression in this scenario:
1. Cpu 0 encounters a panic and acquires the panic_lock
and proceeds with the panic processing.
2. There is an interrupt on cpu 0 that also encounters
an error condition and invokes panic.
3. This second invocation fails to acquire the panic_lock
and enters the infinite while loop in panic_smp_self_stop.
Thus all panic processing is stopped, and the cpu is stuck for eternity
in the while(1) inside panic_smp_self_stop.
To address this, disable local interrupts with local_irq_disable before
acquiring the panic_lock. This will prevent interrupt handlers from
executing during the panic processing, thus avoiding this particular
problem.
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When suid_dumpable=2, detect unsafe core_pattern settings and warn when
they are seen.
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: James Morris <james.l.morris@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just setting the "error" to error number is enough on failure and It
doesn't require to set "error" variable to zero in each switch case,
since it was already initialized with zero. And also removed return 0
in switch case with break statement
Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Like do_wp_page(), __replace_page() should do munlock_vma_page()
for the case when the old page still has other !VM_LOCKED
mappings. Unfortunately this needs mm/internal.h.
Also, move put_page() outside of ptl lock. This doesn't really
matter but looks a bit better.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182249.GA20372@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. vma_address() returns loff_t, this looks confusing and this
is unnecessary after the previous change. Make it return "ulong",
all callers truncate the result anyway.
2. Its name conflicts with mm/rmap.c:vma_address(), rename it to
offset_to_vaddr(), this matches vaddr_to_offset().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182247.GA20365@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. register_for_each_vma() checks that vma_address() == vaddr,
but this is not enough. We should also ensure that
vaddr >= vm_start, find_vma() guarantees "vaddr < vm_end" only.
2. After the prevous changes, register_for_each_vma() is the
only reason why vma_address() has to return loff_t, all other
users know that we have the valid mapping at this offset and
thus the overflow is not possible.
Change the code to use vaddr_to_offset() instead, imho this looks
more clean/understandable and now we can change vma_address().
3. While at it, remove the unnecessary type-cast.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182244.GA20362@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the new helper, vaddr_to_offset(vma, vaddr) which returns
the offset in vma->vm_file this vaddr is mapped at.
Change build_probe_list() and find_active_uprobe() to use the
new helper, the next patch adds another user.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182242.GA20355@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently build_probe_list() builds the list of all uprobes
attached to the given inode, and the caller should filter out
those who don't fall into the [start,end) range, this is
sub-optimal.
This patch turns find_least_offset_node() into
find_node_in_range() which returns the first node inside the
[min,max] range, and changes build_probe_list() to use this node
as a starting point for rb_prev() and rb_next() to find all
other nodes the caller needs. The resulting list is no longer
sorted but we do not care.
This can speed up both build_probe_list() and the callers, but
there is another reason to introduce find_node_in_range(). It
can be used to figure out whether the given vma has uprobes or
not, this will be needed soon.
While at it, shift INIT_LIST_HEAD(tmp_list) into
build_probe_list().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182240.GA20352@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
vma->vm_pgoff is "unsigned long", it should be promoted to
loff_t before the multiplication to avoid the overflow.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182233.GA20339@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
uprobe_munmap() does get_user_pages() and it is also called from
the final mmput()->exit_mmap() path. This slows down
exit/mmput() for no reason, and I think it is simply
dangerous/wrong to try to fault-in a page into the dying mm. If
nothing else, this happens after the last sync_mm_rss(), afaics
handle_mm_fault() can change the task->rss_stat and make the
subsequent check_mm() unhappy.
Change uprobe_munmap() to check mm->mm_users != 0.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182231.GA20336@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The bug was introduced by me in 449d0d7c ("uprobes: Simplify the
usage of uprobe->pending_list").
Yes, we do not care about uprobe->pending_list after return and
nobody can remove the current list entry, but put_uprobe(uprobe)
can actually free it and thus we need list_for_each_safe().
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Link: http://lkml.kernel.org/r/20120729182229.GA20329@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment above write_opcode()->lock_page(old_page) tells
about the race with do_wp_page(). I don't really understand
which exactly race it means, but afaics this lock_page() was not
enough to close all races with do_wp_page().
Anyway, since:
77fc4af1b5 uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing
this code is always called with ->mmap_sem held for writing,
so we can forget about do_wp_page().
However, we can't simply remove this lock_page(), and the only
(afaics) reason is __replace_page()->try_to_free_swap().
Nothing in write_opcode() needs it, move it into
__replace_page() and fix the comment.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182220.GA20322@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
write_opcode() does lock_page(new_page) for no reason. Nobody
can see this page until __replace_page() exposes it under ptl
lock, and we do nothing with this page after pte_unmap_unlock().
If nothing else, the similar code in do_wp_page() doesn't lock
the new page for page_add_new_anon_rmap/set_pte_at_notify.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182218.GA20315@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
page_address_in_vma(old_page) in __replace_page() is ugly and
wrong. The caller already knows the correct virtual address,
this page was found by get_user_pages(vaddr).
However, page_address_in_vma() can actually fail if
page->mapping was cleared by __delete_from_page_cache() after
get_user_pages() returns. But this means the race with page
reclaim, write_opcode() should not fail, it should retry and
read this page again. Probably the race with remove_mapping() is
not possible due to page_freeze_refs() logic, but afaics at
least shmem_writepage()->shmem_delete_from_page_cache() can
clear ->mapping.
We could change __replace_page() to return -EAGAIN in this case,
but it would be better to simply use the caller's vaddr and rely
on page_check_address().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182216.GA20311@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
write_opcode() rechecks valid_vma() and ->f_mapping, this is
pointless. The caller, register_for_each_vma() or uprobe_mmap(),
has already done these checks under mmap_sem.
To clarify, uprobe_mmap() checks valid_vma() only, but we can
rely on build_probe_list(vm_file->f_mapping->host).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182212.GA20304@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Adds audit messages for unexpected link restriction violations so that
system owners will have some sort of potentially actionable information
about misbehaving processes.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Recently, glibc made a change to suppress sign-conversion warnings in
FD_SET (glibc commit ceb9e56b3d1). This uncovered an issue with the
kernel's definition of __NFDBITS if applications #include
<linux/types.h> after including <sys/select.h>. A build failure would
be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
flags to gcc.
It was suggested that the kernel should either match the glibc
definition of __NFDBITS or remove that entirely. The current in-kernel
uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
uses of the related __FDELT and __FDMASK defines. Given that, we'll
continue the cleanup that was started with commit 8b3d1cda4f
("posix_types: Remove fd_set macros") and drop the remaining unused
macros.
Additionally, linux/time.h has similar macros defined that expand to
nothing so we'll remove those at the same time.
Reported-by: Jeff Law <law@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
CC: <stable@vger.kernel.org>
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
[ .. and fix up whitespace as per akpm ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler changes from Ingo Molnar:
"The biggest change is a performance improvement on SMP systems:
| 4 socket 40 core + SMT Westmere box, single 30 sec tbench
| runs, higher is better:
|
| clients 1 2 4 8 16 32 64 128
|..........................................................................
| pre 30 41 118 645 3769 6214 12233 14312
| post 299 603 1211 2418 4697 6847 11606 14557
|
| A nice increase in performance.
which speedup is particularly noticeable on heavily interacting
few-tasks workloads, so the changes should help desktop-style Xorg
workloads and interactivity as well, on multi-core CPUs.
There are also cpuset suspend behavior fixes/restructuring and various
smaller tweaks."
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix race in task_group()
sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
sched: Reset loop counters if all tasks are pinned and we need to redo load balance
sched: Reorder 'struct lb_env' members to reduce its size
sched: Improve scalability via 'CPU buddies', which withstand random perturbations
cpusets: Remove/update outdated comments
cpusets, hotplug: Restructure functions that are invoked during hotplug
cpusets, hotplug: Implement cpuset tree traversal in a helper function
CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
sched/x86: Remove broken power estimation
Here's the big driver core pull request for 3.6-rc1.
Unlike 3.5, this kernel should be a lot tamer, with the printk changes now
settled down. All we have here is some extcon driver updates, w1 driver
updates, a few printk cleanups that weren't needed for 3.5, but are good to
have now, and some other minor fixes/changes in the driver core.
All of these have been in the linux-next releases for a while now.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAlARgIUACgkQMUfUDdst+ynDHgCfRNwIB9L+zZvjcKE5e1BhDbUl
wVUAn398DFgbJ1+PjGkd1EMR2uVTh7Ou
=MIFu
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core changes from Greg Kroah-Hartman:
"Here's the big driver core pull request for 3.6-rc1.
Unlike 3.5, this kernel should be a lot tamer, with the printk changes
now settled down. All we have here is some extcon driver updates, w1
driver updates, a few printk cleanups that weren't needed for 3.5, but
are good to have now, and some other minor fixes/changes in the driver
core.
All of these have been in the linux-next releases for a while now.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits)
printk: Export struct log size and member offsets through vmcoreinfo
Drivers: hv: Change the hex constant to a decimal constant
driver core: don't trigger uevent after failure
extcon: MAX77693: Add extcon-max77693 driver to support Maxim MAX77693 MUIC device
sysfs: fail dentry revalidation after namespace change fix
sysfs: fail dentry revalidation after namespace change
extcon: spelling of detach in function doc
extcon: arizona: Stop microphone detection if we give up on it
extcon: arizona: Update cable reporting calls and split headset
PM / Runtime: Do not increment device usage counts before probing
kmsg - do not flush partial lines when the console is busy
kmsg - export "continuation record" flag to /dev/kmsg
kmsg - avoid warning for CONFIG_PRINTK=n compilations
kmsg - properly print over-long continuation lines
driver-core: Use kobj_to_dev instead of re-implementing it
driver-core: Move kobj_to_dev from genhd.h to device.h
driver core: Move deferred devices to the end of dpm_list before probing
driver core: move uevent call to driver_register
driver core: fix shutdown races with probe/remove(v3)
Extcon: Arizona: Add driver for Wolfson Arizona class devices
...
Here's the big staging tree merge for the 3.6-rc1 merge window.
There are some patches in here outside of drivers/staging/, notibly the iio
code (which is still stradeling the staging / not staging boundry), the pstore
code, and the tracing code. All of these have gotten ackes from the various
subsystem maintainers to be included in this tree. The pstore and tracing
patches are related, and are coming here as they replace one of the android
staging drivers.
Otherwise, the normal staging mess. Lots of cleanups and a few new drivers
(some iio drivers, and the large csr wireless driver abomination.)
Note, you will get a merge issue with the following files:
drivers/staging/comedi/drivers/s626.h
drivers/staging/gdm72xx/netlink_k.c
both of which should be trivial for you to handle.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAlAQiD8ACgkQMUfUDdst+ykxhgCeMUjvc+1RTtSprzvkzpejgoUU
6A4AnAleWMnkaCD8vruGnRdGl/Qtz51+
=mN6M
-----END PGP SIGNATURE-----
Merge tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging tree patches from Greg Kroah-Hartman:
"Here's the big staging tree merge for the 3.6-rc1 merge window.
There are some patches in here outside of drivers/staging/, notibly
the iio code (which is still stradeling the staging / not staging
boundry), the pstore code, and the tracing code. All of these have
gotten acks from the various subsystem maintainers to be included in
this tree. The pstore and tracing patches are related, and are coming
here as they replace one of the android staging drivers.
Otherwise, the normal staging mess. Lots of cleanups and a few new
drivers (some iio drivers, and the large csr wireless driver
abomination.)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fixed up trivial conflicts in drivers/staging/comedi/drivers/s626.h and
drivers/staging/gdm72xx/netlink_k.c
* tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1108 commits)
staging: csr: delete a bunch of unused library functions
staging: csr: remove csr_utf16.c
staging: csr: remove csr_pmem.h
staging: csr: remove CsrPmemAlloc
staging: csr: remove CsrPmemFree()
staging: csr: remove CsrMemAllocDma()
staging: csr: remove CsrMemCalloc()
staging: csr: remove CsrMemAlloc()
staging: csr: remove CsrMemFree() and CsrMemFreeDma()
staging: csr: remove csr_util.h
staging: csr: remove CsrOffSetOf()
stating: csr: remove unneeded #includes in csr_util.c
staging: csr: make CsrUInt16ToHex static
staging: csr: remove CsrMemCpy()
staging: csr: remove CsrStrLen()
staging: csr: remove CsrVsnprintf()
staging: csr: remove CsrStrDup
staging: csr: remove CsrStrChr()
staging: csr: remove CsrStrNCmp
staging: csr: remove CsrStrCmp
...
Otherwise they can't be filtered for a defined task:
perf record -e sched:sched_switch ./foo
This command doesn't report any events without this patch.
I think it isn't a security concern if someone knows who will
be executed next - this can already be observed by polling /proc
state. By default perf is disabled for non-root users in any case.
I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/r/1342088069-1005148-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It seems there's no specific reason to open-code it. I guess
commit 0122ec5b02 ("sched: Add p->pi_lock to task_rq_lock()")
simply missed it. Let's be consistent with others.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341647342-6742-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some interrupt chips like MSI are oneshot safe by implementation. For
those interrupts we can avoid the mask/unmask sequence for threaded
interrupt handlers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1207132056540.32033@ionos
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Jan Kiszka <jan.kiszka@web.de>
When the map operation fails log the error code we get and add a WARN_ON()
so we get a backtrace (which should help work out which interrupt is the
source of the issue).
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
With the current state of irq_domain, the reverse map is always updated
when new IRQs get mapped. This means that the irq_find_mapping() function
can be simplified to execute the revmap lookup functions unconditionally
This patch adds lookup functions for the revmaps that don't yet have one
and removes the slow path lookup code path.
v8: Broke out unrelated changes into separate patches. Rebased on Paul's irq
association patches.
v7: Rebased to irqdomain/next for v3.4 and applied before the removal of 'hint'
v6: Remove the slow path entirely. The only place where the slow path
could get called is for a linear mapping if the hwirq number is larger
than the linear revmap size. There shouldn't be any interrupt
controllers that do that.
v5: rewrite to not use a ->revmap() callback. It is simpler, smaller,
safer and faster to open code each of the revmap lookups directly into
irq_find_mapping() via a switch statement.
v4: Fix build failure on incorrect variable reference.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Rob Herring <rob.herring@calxeda.com>
The most important feature of this patch set is the new async infrastructure
that makes sure async_synchronize_full() synchronizes all domains and allows
us to remove all the hacks (like having scsi_complete_async_scans() in the
device base code) and means that the async infrastructure will "just work" in
future. The rest is assorted driver updates (aacraid, bnx2fc, virto-scsi,
megaraid, bfa, lpfc, qla2xxx, qla4xxx) plus a lot of infrastructure work in
sas and FC.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJQDjDCAAoJEDeqqVYsXL0M/sMH/jVgBfF1mjR+DQuTscKyD21w
0BQLn5OmvDZDqo44iqQzNRObw7CxkBkUtHoozsknLijw+KggER653ZOAtUdIHfI/
/uo7iJQ3J3D/Ezm99HYSpZiF2juZwsBRtFBoKkGqOpMlzFUx5o4hUbH5OcINxnHR
VmvJU5K1kg8D77Q6zK+Atl14/Rfibc2IoufFmbYdplUAM/tV0BpBSSHJAJvqua76
NGMl4KJcPZnXe/4LXcxZia5A2efdFFEzaQ2mM9rUVEAgHDAxc0Zg9IoDhGd08FX4
G55NK+6+bKb9s7bgyva0T/iy817TRCzjteeYNFrb8nBRe7aQbAivaBHQFXIyvdQ=
=y2sh
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull first round of SCSI updates from James Bottomley:
"The most important feature of this patch set is the new async
infrastructure that makes sure async_synchronize_full() synchronizes
all domains and allows us to remove all the hacks (like having
scsi_complete_async_scans() in the device base code) and means that
the async infrastructure will "just work" in future.
The rest is assorted driver updates (aacraid, bnx2fc, virto-scsi,
megaraid, bfa, lpfc, qla2xxx, qla4xxx) plus a lot of infrastructure
work in sas and FC.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (97 commits)
[SCSI] Revert "[SCSI] fix async probe regression"
[SCSI] cleanup usages of scsi_complete_async_scans
[SCSI] queue async scan work to an async_schedule domain
[SCSI] async: make async_synchronize_full() flush all work regardless of domain
[SCSI] async: introduce 'async_domain' type
[SCSI] bfa: Fix to set correct return error codes and misc cleanup.
[SCSI] aacraid: Series 7 Async. (performance) mode support
[SCSI] aha152x: Allow use on 64bit systems
[SCSI] virtio-scsi: Add vdrv->scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning
[SCSI] bfa: squelch lockdep complaint with a spin_lock_init
[SCSI] qla2xxx: remove unnecessary reads of PCI_CAP_ID_EXP
[SCSI] qla4xxx: remove unnecessary read of PCI_CAP_ID_EXP
[SCSI] ufs: fix incorrect return value about SUCCESS and FAILED
[SCSI] ufs: reverse the ufshcd_is_device_present logic
[SCSI] ufs: use module_pci_driver
[SCSI] usb-storage: update usb devices for write cache quirk in quirk list.
[SCSI] usb-storage: add support for write cache quirk
[SCSI] set to WCE if usb cache quirk is present.
[SCSI] virtio-scsi: hotplug support for virtio-scsi
[SCSI] virtio-scsi: split scatterlist per target
...
Pull cgroup changes from Tejun Heo:
"Nothing too interesting. A minor bug fix and some cleanups."
* 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Update remount documentation
cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode
cgroup: Remove populate() documentation
cgroup: remove hierarchy_mutex
Pull workqueue changes from Tejun Heo:
"There are three major changes.
- WQ_HIGHPRI has been reimplemented so that high priority work items
are served by worker threads with -20 nice value from dedicated
highpri worker pools.
- CPU hotplug support has been reimplemented such that idle workers
are kept across CPU hotplug events. This makes CPU hotplug cheaper
(for PM) and makes the code simpler.
- flush_kthread_work() has been reimplemented so that a work item can
be freed while executing. This removes an annoying behavior
difference between kthread_worker and workqueue."
* 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix spurious CPU locality WARN from process_one_work()
kthread_worker: reimplement flush_kthread_work() to allow freeing the work item being executed
kthread_worker: reorganize to prepare for flush_kthread_work() reimplementation
workqueue: simplify CPU hotplug code
workqueue: remove CPU offline trustee
workqueue: don't butcher idle workers on an offline CPU
workqueue: reimplement CPU online rebinding to handle idle workers
workqueue: drop @bind from create_worker()
workqueue: use mutex for global_cwq manager exclusion
workqueue: ROGUE workers are UNBOUND workers
workqueue: drop CPU_DYING notifier operation
workqueue: perform cpu down operations from low priority cpu_notifier()
workqueue: reimplement WQ_HIGHPRI using a separate worker_pool
workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()
workqueue: separate out worker_pool flags
workqueue: use @pool instead of @gcwq or @cpu where applicable
workqueue: factor out worker_pool from global_cwq
workqueue: don't use WQ_HIGHPRI for unbound workqueues
Host bridge hotplug
- Add MMCONFIG support for hot-added host bridges (Jiang Liu)
Device hotplug
- Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
- Call FINAL fixups for hot-added devices, too (Myron Stowe)
- Factor out generic code for P2P bridge hot-add (Yinghai Lu)
- Remove all functions in a slot, not just those with _EJx (Amos Kong)
Dynamic resource management
- Track bus number allocation (struct resource tree per domain) (Yinghai Lu)
- Make P2P bridge 1K I/O windows work with resource reassignment (Bjorn Helgaas, Yinghai Lu)
- Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
Power management
- Add PCIe runtime D3cold support (Huang Ying)
Virtualization
- Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex Williamson)
- Add quirks for devices with broken INTx masking (Jan Kiszka)
Miscellaneous
- Fix some PCI Express capability version issues (Myron Stowe)
- Factor out some arch code with a weak, generic, pcibios_setup() (Myron Stowe)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQBy+9AAoJEPGMOI97Hn6zOpQP+wVFvA7pcteFj6HPs5nTq2Hc
55oeRqCO0wBHoFMCKB0AjeTATjqxi9OhcjaiVrZejxNyWKC9MnrXuunpQ0l/hCbR
M/TK+BCelfX2FU4eXNf+TBCCcOhOVWqQft9Gm6nYKwX8Y0msRVCceI4WwhZgSwtI
vdtmnqlwolscdnq+8ThsnvUMtwkN0gExmn2FJRl6EoEgG0DTqhMkZ83uA+NPBhvv
I+g0XbA6haaZph2nnSYR0hIW4Q7JkT/LgA6uVAQxamctwxLol7xxsjCRnfqrulkf
kaRr2fAgBXfmaOIltro4UkXrCM52ZSyggCDfExHp6mWGPKMjE5ZcyK1YbGfmmumk
DS3t1S0eBdDJXrnf9l/Yb8e95dQxRCYKelKzr1rTD9QAXsInE8rC40hvhfFaTa4s
nZYRTz0SKv6coQihqaOR7shx1DNomLFk7jndaWEElfl9/cT/nQnZ8XLfVMzkJNNB
Y4SM6zkiIaCL0aiSEE16MqVjmODYRjbURLYzQIrqr2KJQg8X6XjIRojQLjL6xEgA
22ry2ZRPhqO68g7aLqvixiSDaTp0Z0Vw+JmgjtBqvkokwZcGQtm4umkpAdOi+Es8
3bJaMY7ZUpDX53FE8iyP6AnmR/1k19rC1gNnNq/syWyjtYOYJ9i3QCTafFgvE1VC
5coQ1L5tByHvpzK5PHwf
=oo/A
-----END PGP SIGNATURE-----
Merge tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI changes from Bjorn Helgaas:
"Host bridge hotplug:
- Add MMCONFIG support for hot-added host bridges (Jiang Liu)
Device hotplug:
- Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
- Call FINAL fixups for hot-added devices, too (Myron Stowe)
- Factor out generic code for P2P bridge hot-add (Yinghai Lu)
- Remove all functions in a slot, not just those with _EJx (Amos
Kong)
Dynamic resource management:
- Track bus number allocation (struct resource tree per domain)
(Yinghai Lu)
- Make P2P bridge 1K I/O windows work with resource reassignment
(Bjorn Helgaas, Yinghai Lu)
- Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
Power management:
- Add PCIe runtime D3cold support (Huang Ying)
Virtualization:
- Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex
Williamson)
- Add quirks for devices with broken INTx masking (Jan Kiszka)
Miscellaneous:
- Fix some PCI Express capability version issues (Myron Stowe)
- Factor out some arch code with a weak, generic, pcibios_setup()
(Myron Stowe)"
* tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (122 commits)
PCI: hotplug: ensure a consistent return value in error case
PCI: fix undefined reference to 'pci_fixup_final_inited'
PCI: build resource code for M68K architecture
PCI: pciehp: remove unused pciehp_get_max_lnk_width(), pciehp_get_cur_lnk_width()
PCI: reorder __pci_assign_resource() (no change)
PCI: fix truncation of resource size to 32 bits
PCI: acpiphp: merge acpiphp_debug and debug
PCI: acpiphp: remove unused res_lock
sparc/PCI: replace pci_cfg_fake_ranges() with pci_read_bridge_bases()
PCI: call final fixups hot-added devices
PCI: move final fixups from __init to __devinit
x86/PCI: move final fixups from __init to __devinit
MIPS/PCI: move final fixups from __init to __devinit
PCI: support sizing P2P bridge I/O windows with 1K granularity
PCI: reimplement P2P bridge 1K I/O windows (Intel P64H2)
PCI: disable MEM decoding while updating 64-bit MEM BARs
PCI: leave MEM and IO decoding disabled during 64-bit BAR sizing, too
PCI: never discard enable/suspend/resume_early/resume fixups
PCI: release temporary reference in __nv_msi_ht_cap_quirk()
PCI: restructure 'pci_do_fixups()'
...
A small set of changes for devicetree:
- Couple of Documentation fixes
- Addition of new helper function of_node_full_name
- Improve of_parse_phandle_with_args return values
- Some NULL related sparse fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQEcBAABAgAGBQJQDwsgAAoJEMhvYp4jgsXiuwUH/Ri6ZSnqHcz4Wa/X4FxvNc3I
3Xelo/Vt3WLYue3s/+OYiM5FK9+KH8T6x+U79Q4p7vePcfUh6GJII0AUbMeRghkS
m3FjNd5syzYNJlnDnqdngQYRDpaz8U/SyftjXyMPjJ1VWiyLx/EJQUkj1EEwDLe/
ZVabppnco3Y6OJpFuETONNvXx5mE7xq86isW5+aYmviMkWSMMwJPf8qofLJ78Dh5
OAhWuCPRDooz548+Wkabt90qHjF6FU43w5fU7zZW26NT39ptppcbZ2bAXcTYqIIq
sATp5YSitvwFqO2c1mA/drZ9nrgxDPCaw3qCDyiMdcbWgXqDirz2x7q1iauVHF4=
=5TZ/
-----END PGP SIGNATURE-----
Merge tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux
Pull devicetree updates from Rob Herring:
"A small set of changes for devicetree:
- Couple of Documentation fixes
- Addition of new helper function of_node_full_name
- Improve of_parse_phandle_with_args return values
- Some NULL related sparse fixes"
Grant's busy packing.
* tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux:
of: mtd: nuke useless const qualifier
devicetree: add helper inline for retrieving a node's full name
of: return -ENOENT when no property
usage-model.txt: fix typo machine_init->init_machine
of: Fix null pointer related warnings in base.c file
LED: Fix missing semicolon in OF documentation
of: fix a few typos in the binding documentation
Pull networking changes from David S Miller:
1) Remove the ipv4 routing cache. Now lookups go directly into the FIB
trie and use prebuilt routes cached there.
No more garbage collection, no more rDOS attacks on the routing
cache. Instead we now get predictable and consistent performance,
no matter what the pattern of traffic we service.
This has been almost 2 years in the making. Special thanks to
Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
have helped along the way.
I'm sure that with a change of this magnitude there will be some
kind of fallout, but such things ought the be simple to fix at this
point. Luckily I'm not European so I'll be around all of August to
fix things :-)
The major stages of this work here are each fronted by a forced
merge commit whose commit message contains a top-level description
of the motivations and implementation issues.
2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
input.
3) TCP SYN/ACK performance tweaks from Eric Dumazet.
4) Add namespace support for netfilter L4 conntrack helpers, from Gao
Feng.
5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
Yuval Mintz.
6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.
7) Support for connection tracker helpers in userspace, from Pablo
Neira Ayuso.
8) Allow userspace driven TX load balancing functions in TEAM driver,
from Jiri Pirko.
9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
embedded gotos.
10) TCP Small Queues, essentially minimize the amount of TCP data queued
up in the packet scheduler layer. Whereas the existing BQL (Byte
Queue Limits) limits the pkt_sched --> netdevice queuing levels,
this controls the TCP --> pkt_sched queueing levels.
From Eric Dumazet.
11) Reduce the number of get_page/put_page ops done on SKB fragments,
from Alexander Duyck.
12) Implement protection against blind resets in TCP (RFC 5961), from
Eric Dumazet.
13) Support the client side of TCP Fast Open, basically the ability to
send data in the SYN exchange, from Yuchung Cheng.
Basically, the sender queues up data with a sendmsg() call using
MSG_FASTOPEN, then they do the connect() which emits the queued up
fastopen data.
14) Avoid all the problems we get into in TCP when timers or PMTU events
hit a locked socket. The TCP Small Queues changes added a
tcp_release_cb() that allows us to queue work up to the
release_sock() caller, and that's what we use here too. From Eric
Dumazet.
15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
r8169: revert "add byte queue limit support".
ipv4: Change rt->rt_iif encoding.
net: Make skb->skb_iif always track skb->dev
ipv4: Prepare for change of rt->rt_iif encoding.
ipv4: Remove all RTCF_DIRECTSRC handliing.
ipv4: Really ignore ICMP address requests/replies.
decnet: Don't set RTCF_DIRECTSRC.
net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
ipv4: Remove redundant assignment
rds: set correct msg_namelen
openvswitch: potential NULL deref in sample()
tcp: dont drop MTU reduction indications
bnx2x: Add new 57840 device IDs
tcp: avoid oops in tcp_metrics and reset tcpm_stamp
niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
niu: Fix to check for dma mapping errors.
net: Fix references to out-of-scope variables in put_cmsg_compat()
net: ethernet: davinci_emac: add pm_runtime support
net: ethernet: davinci_emac: Remove unnecessary #include
...
commit 1e75fa8b (time: Condense timekeeper.xtime into xtime_sec)
introduced helper functions which apply a timespec to the core
internal timekeeper data. The internal storage type is u64. The
timespec tv_nsec value must be shifted before set or added to the
internal value. tv_nsec is a long, which is 32bit on a 32bit system,
so without casting tv_nsec to u64 we lose the bits which are shifted
over the 32bit boundary.
Add the proper typecasts.
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343074957-16541-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If uaddr == uaddr2, then we have broken the rule of only requeueing
from a non-pi futex to a pi futex with this call. If we attempt this,
as the trinity test suite manages to do, we miss early wakeups as
q.key is equal to key2 (because they are the same uaddr). We will then
attempt to dereference the pi_mutex (which would exist had the futex_q
been properly requeued to a pi futex) and trigger a NULL pointer
dereference.
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/ad82bfe7f7d130247fbe2b5b4275654807774227.1342809673.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Stefan reported a crash on a kernel before a3e5d1091c ("sched:
Don't call task_group() too many times in set_task_rq()"), he
found the reason to be that the multiple task_group()
invocations in set_task_rq() returned different values.
Looking at all that I found a lack of serialization and plain
wrong comments.
The below tries to fix it using an extra pointer which is
updated under the appropriate scheduler locks. Its not pretty,
but I can't really see another way given how all the cgroup
stuff works.
Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current load balance scheme requires only one cpu in a
sched_group (balance_cpu) to look at other peer sched_groups for
imbalance and pull tasks towards itself from a busy cpu. Tasks
thus pulled by balance_cpu could later get picked up by cpus
that are in the same sched_group as that of balance_cpu.
This scheme however fails to pull tasks that are not allowed to
run on balance_cpu (but are allowed to run on other cpus in its
sched_group). That can affect fairness and in some worst case
scenarios cause starvation.
Consider a two core (2 threads/core) system running tasks as
below:
Core0 Core1
/ \ / \
C0 C1 C2 C3
| | | |
v v v v
F0 T1 F1 [idle]
T2
F0 = SCHED_FIFO task (pinned to C0)
F1 = SCHED_FIFO task (pinned to C2)
T1 = SCHED_OTHER task (pinned to C1)
T2 = SCHED_OTHER task (pinned to C1 and C2)
F1 could become a cpu hog, which will starve T2 unless C1 pulls
it. Between C0 and C1 however, C0 is required to look for
imbalance between cores, which will fail to pull T2 towards
Core0. T2 will starve eternally in this case. The same scenario
can arise in presence of non-rt tasks as well (say we replace F1
with high irq load).
We tackle this problem by having balance_cpu move pinned tasks
to one of its sibling cpus (where they can run). We first check
if load balance goal can be met by ignoring pinned tasks,
failing which we retry move_tasks() with a new env->dst_cpu.
This patch modifies load balance semantics on who can move load
towards a given cpu in a given sched_domain.
Before this patch, a given_cpu or a ilb_cpu acting on behalf of
an idle given_cpu is responsible for moving load to given_cpu.
With this patch applied, balance_cpu can in addition decide on
moving some load to a given_cpu.
There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
*independently* and at *same* time to move some load to a
given_cpu). However we should see less of such conflicting
decisions in practice and moreover subsequent load balance
cycles should correct the excess load moved to given_cpu.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While load balancing, if all tasks on the source runqueue are pinned,
we retry after excluding the corresponding source cpu. However, loop counters
env.loop and env.loop_break are not reset before retrying, which can lead
to failure in moving the tasks. In this patch we reset env.loop and
env.loop_break to their inital values before we retry.
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06EEF.2090709@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Members of 'struct lb_env' are not in appropriate order to reuse compiler
added padding on 64bit architectures. In this patch we reorder those struct
members and help reduce the size of the structure from 96 bytes to 80
bytes on 64 bit architectures.
Suggested-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06DDE.7000403@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Traversing an entire package is not only expensive, it also leads to tasks
bouncing all over a partially idle and possible quite large package. Fix
that up by assigning a 'buddy' CPU to try to motivate. Each buddy may try
to motivate that one other CPU, if it's busy, tough, it may then try its
SMT sibling, but that's all this optimization is allowed to cost.
Sibling cache buddies are cross-wired to prevent bouncing.
4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:
clients 1 2 4 8 16 32 64 128
..........................................................................
pre 30 41 118 645 3769 6214 12233 14312
post 299 603 1211 2418 4697 6847 11606 14557
A nice increase in performance.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpuset_track_online_cpus() is no longer present. So remove the
outdated comment and replace it with reference to cpuset_update_active_cpus()
which is its equivalent.
Also, we don't lack memory hot-unplug anymore. And David Rientjes pointed
out how it is dealt with. So update that comment as well.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141700.3692.98192.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Separate out the cpuset related handling for CPU/Memory online/offline.
This also helps us exploit the most obvious and basic level of optimization
that any notification mechanism (CPU/Mem online/offline) has to offer us:
"We *know* why we have been invoked. So stop pretending that we are lost,
and do only the necessary amount of processing!".
And while at it, rename scan_for_empty_cpusets() to
scan_cpusets_upon_hotplug(), which is more appropriate considering how
it is restructured.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At present, the functions that deal with cpusets during CPU/Mem hotplug
are quite messy, since a lot of the functionality is mixed up without clear
separation. And this takes a toll on optimization as well. For example,
the function cpuset_update_active_cpus() is called on both CPU offline and CPU
online events; and it invokes scan_for_empty_cpusets(), which makes sense
only for CPU offline events. And hence, the current code ends up unnecessarily
traversing the cpuset tree during CPU online also.
As a first step towards cleaning up those functions, encapsulate the cpuset
tree traversal in a helper function, so as to facilitate upcoming changes.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141635.3692.893.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.
However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.
In order to achieve that, do the following:
1. Don't modify cpusets during suspend/resume. At all.
In particular, don't move the tasks from one cpuset to another, and
don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
during the CPU hotplug operations that are carried out in the
suspend/resume path.
2. However, cpusets and sched domains are related. We just want to avoid
altering cpusets alone. So, to keep the sched domains updated, build
a single sched domain (containing all active cpus) during each of the
CPU hotplug operations carried out in s/r path, effectively ignoring
the cpusets' cpus_allowed masks.
(Since userspace is frozen while doing all this, it will go unnoticed.)
3. During the last CPU online operation during resume, build the sched
domains by looking up the (unaltered) cpusets' cpus_allowed masks.
That will bring back the system to the same original state as it was in
before suspend.
Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull the big VFS changes from Al Viro:
"This one is *big* and changes quite a few things around VFS. What's in there:
- the first of two really major architecture changes - death to open
intents.
The former is finally there; it was very long in making, but with
Miklos getting through really hard and messy final push in
fs/namei.c, we finally have it. Unlike his variant, this one
doesn't introduce struct opendata; what we have instead is
->atomic_open() taking preallocated struct file * and passing
everything via its fields.
Instead of returning struct file *, it returns -E... on error, 0
on success and 1 in "deal with it yourself" case (e.g. symlink
found on server, etc.).
See comments before fs/namei.c:atomic_open(). That made a lot of
goodies finally possible and quite a few are in that pile:
->lookup(), ->d_revalidate() and ->create() do not get struct
nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
flags instead, ->create() gets "do we want it exclusive" flag.
With the introduction of new helper (kern_path_locked()) we are rid
of all struct nameidata instances outside of fs/namei.c; it's still
visible in namei.h, but not for long. Come the next cycle,
declaration will move either to fs/internal.h or to fs/namei.c
itself. [me, miklos, hch]
- The second major change: behaviour of final fput(). Now we have
__fput() done without any locks held by caller *and* not from deep
in call stack.
That obviously lifts a lot of constraints on the locking in there.
Moreover, it's legal now to call fput() from atomic contexts (which
has immediately simplified life for aio.c). We also don't need
anti-recursion logics in __scm_destroy() anymore.
There is a price, though - the damn thing has become partially
asynchronous. For fput() from normal process we are guaranteed
that pending __fput() will be done before the caller returns to
userland, exits or gets stopped for ptrace.
For kernel threads and atomic contexts it's done via
schedule_work(), so theoretically we might need a way to make sure
it's finished; so far only one such place had been found, but there
might be more.
There's flush_delayed_fput() (do all pending __fput()) and there's
__fput_sync() (fput() analog doing __fput() immediately). I hope
we won't need them often; see warnings in fs/file_table.c for
details. [me, based on task_work series from Oleg merged last
cycle]
- sync series from Jan
- large part of "death to sync_supers()" work from Artem; the only
bits missing here are exofs and ext4 ones. As far as I understand,
those are going via the exofs and ext4 trees resp.; once they are
in, we can put ->write_super() to the rest, along with the thread
calling it.
- preparatory bits from unionmount series (from dhowells).
- assorted cleanups and fixes all over the place, as usual.
This is not the last pile for this cycle; there's at least jlayton's
ESTALE work and fsfreeze series (the latter - in dire need of fixes,
so I'm not sure it'll make the cut this cycle). I'll probably throw
symlink/hardlink restrictions stuff from Kees into the next pile, too.
Plus there's a lot of misc patches I hadn't thrown into that one -
it's large enough as it is..."
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
switch dentry_open() to struct path, make it grab references itself
spufs: shift dget/mntget towards dentry_open()
zoran: don't bother with struct file * in zoran_map
ecryptfs: don't reinvent the wheels, please - use struct completion
don't expose I_NEW inodes via dentry->d_inode
tidy up namei.c a bit
unobfuscate follow_up() a bit
ext3: pass custom EOF to generic_file_llseek_size()
ext4: use core vfs llseek code for dir seeks
vfs: allow custom EOF in generic_file_llseek code
vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
vfs: Remove unnecessary flushing of block devices
vfs: Make sys_sync writeout also block device inodes
vfs: Create function for iterating over block devices
vfs: Reorder operations during sys_sync
quota: Move quota syncing to ->sync_fs method
quota: Split dquot_quota_sync() to writeback and cache flushing part
vfs: Move noop_backing_dev_info check from sync into writeback
...
* ACPI conversion to PM handling based on struct dev_pm_ops.
* Conversion of a number of platform drivers to PM handling based on struct
dev_pm_ops and removal of empty legacy PM callbacks from a couple of PCI
drivers.
* Suspend-to-both for in-kernel hibernation from Bojan Smojver.
* cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti U Murthy.
* cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
* Suspend and hibernate fixes from Srivatsa S. Bhat and Colin Cross.
* Generic PM domains framework updates.
* RTC CMOS wakeup signaling update from Paul Fox.
* sparse warnings fixes from Sachin Kamat.
* Build warnings fixes for the generic PM domains framework and PM sysfs code.
* sysfs switch for printing device suspend times from Sameer Nanda.
* Documentation fix from Oskar Schirmer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQDF5eAAoJEKhOf7ml8uNsEaAP/2wg4faoOGob5A0/7tLqG3Cw
xnTmGsfL7wG07Q8ykCL1BSlBb1VeJz8L6LTmUpaABI4M//oIBlcYQKyCE0Tat1AO
9bJXFzK7qcHMhkTz6d6LDqtVzR3NGM3ypjZqj8aEXBov07LMR1AXvgNwXXhv25zM
0unwrh1XNinBN3n+oaktpWk1YHUjsa5IMU+2tQJrocuHXcgK30vGXZVrZ4g9w1c2
eS+ED1oKUqOYtFzIUX+aCtaDDheGaPlugk/GOtIB7Sae0s0vMlxH/T5ncB4SxRC+
v3s4OykqQc5Dc8+0bNlBH7ykSVNB0PoQiyKDY67CxtH+q1xQSc9/f3XJqnGMaVDE
17eZUZsL4qSyzRuCbPCGAgwBHmx3qNCMu1i1BcmnSxU+ikPUeCR7mYOP0mRThwPH
OSfs+c/vZ+Ow6CwVE4UFrbm9Jve7ADnCrlZzT2m6XjhHGyjKP7SJlzP9TPsZ0LRk
oxgQDYHmxbo50t9tBCz5L4ZTMKkDp28e78x84/CteP85srcW3GqDxrPyp2uzJu5O
tvIEBvVlc4ucq8sG83RkugQwrG/2cQwG2HO9ERAwq01HHA1BYsuU3A961Jqf5CZo
nFRSnByvVj/imPf47OWpDPAbVEs7jxufJuLEbPwGj1MkttTGDBIRu3zldXt2S6kP
Q4qYU6fDaQQHFc90pqxQ
=vC4/
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
- ACPI conversion to PM handling based on struct dev_pm_ops.
- Conversion of a number of platform drivers to PM handling based on
struct dev_pm_ops and removal of empty legacy PM callbacks from a
couple of PCI drivers.
- Suspend-to-both for in-kernel hibernation from Bojan Smojver.
- cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti
Murthy.
- cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
- Suspend and hibernate fixes from Srivatsa Bhat and Colin Cross.
- Generic PM domains framework updates.
- RTC CMOS wakeup signaling update from Paul Fox.
- sparse warnings fixes from Sachin Kamat.
- Build warnings fixes for the generic PM domains framework and PM
sysfs code.
- sysfs switch for printing device suspend times from Sameer Nanda.
- Documentation fix from Oskar Schirmer.
* tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (70 commits)
cpufreq: Fix sysfs deadlock with concurrent hotplug/frequency switch
EXYNOS: bugfix on retrieving old_index from freqs.old
PM / Sleep: call early resume handlers when suspend_noirq fails
PM / QoS: Use NULL pointer instead of plain integer in qos.c
PM / QoS: Use NULL pointer instead of plain integer in pm_qos.h
PM / Sleep: Require CAP_BLOCK_SUSPEND to use wake_lock/wake_unlock
PM / Sleep: Add missing static storage class specifiers in main.c
cpuilde / ACPI: remove time from acpi_processor_cx structure
cpuidle / ACPI: remove usage from acpi_processor_cx structure
cpuidle / ACPI : remove latency_ticks from acpi_processor_cx structure
rtc-cmos: report wakeups from interrupt handler
PM / Sleep: Fix build warning in sysfs.c for CONFIG_PM_SLEEP unset
PM / Domains: Fix build warning for CONFIG_PM_RUNTIME unset
olpc-xo15-sci: Use struct dev_pm_ops for power management
PM / Domains: Replace plain integer with NULL pointer in domain.c file
PM / Domains: Add missing static storage class specifier in domain.c file
PM / crypto / ux500: Use struct dev_pm_ops for power management
PM / IPMI: Remove empty legacy PCI PM callbacks
tpm_nsc: Use struct dev_pm_ops for power management
tpm_tis: Use struct dev_pm_ops for power management
...
It doesn't matter on normal return to userland path (we'll recheck the
NOTIFY_RESUME flag anyway), but in case of exit_task_work() we'll
need that as soon as we get callbacks capable of triggering more
task_work_add().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
task_work and rcu_head are identical now; merge them (calling the result
struct callback_head, rcu_head #define'd to it), kill separate allocation
in security/keys since we can just use cred->rcu now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
layout based on Oleg's suggestion; single-linked list,
task->task_works points to the last element, forward pointer
from said last element points to head. I'd still prefer
much more regular scheme with two pointers in task_work,
but...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
get rid of the only user of ->data; this is _not_ the final variant - in the
end we'll have task_work and rcu_head identical and just use cred->rcu,
at which point the separate allocation will be gone completely.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull timer core changes from Ingo Molnar:
"Continued cleanups of the core time and NTP code, plus more nohz work
preparing for tick-less userspace execution."
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Rework timekeeping functions to take timekeeper ptr as argument
time: Move xtime_nsec adjustment underflow handling timekeeping_adjust
time: Move arch_gettimeoffset() usage into timekeeping_get_ns()
time: Refactor accumulation of nsecs to secs
time: Condense timekeeper.xtime into xtime_sec
time: Explicitly use u32 instead of int for shift values
time: Whitespace cleanups per Ingo%27s requests
nohz: Move next idle expiry time record into idle logic area
nohz: Move ts->idle_calls incrementation into strict idle logic
nohz: Rename ts->idle_tick to ts->last_tick
nohz: Make nohz API agnostic against idle ticks cputime accounting
nohz: Separate idle sleeping time accounting from nohz logic
timers: Improve get_next_timer_interrupt()
timers: Add accounting of non deferrable timers
timers: Consolidate base->next_timer update
timers: Create detach_if_pending() and use it
Pull perf events changes from Ingo Molnar:
"- kernel side:
- Intel uncore PMU support for Nehalem and Sandy Bridge CPUs, we
support both the events available via the MSR and via the PCI
access space.
- various uprobes cleanups and restructurings
- PMU driver quirks by microcode version and required x86 microcode
loader cleanups/robustization
- various tracing robustness updates
- static keys: remove obsolete static_branch()
- tooling side:
- GTK browser improvements
- perf report browser: support screenshots to file
- more automated tests
- perf kvm improvements
- perf bench refinements
- build environment improvements
- pipe mode improvements
- libtraceevent updates, we have now hopefully merged most bits with
the out of tree forked code base
... and many other goodies."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (138 commits)
tracing: Check for allocation failure in __tracing_open()
perf/x86: Fix intel_perfmon_event_mapformatting
jump label: Remove static_branch()
tracepoint: Use static_key_false(), since static_branch() is deprecated
perf/x86: Uncore filter support for SandyBridge-EP
perf/x86: Detect number of instances of uncore CBox
perf/x86: Fix event constraint for SandyBridge-EP C-Box
perf/x86: Use 0xff as pseudo code for fixed uncore event
perf/x86: Save a few bytes in 'struct x86_pmu'
perf/x86: Add a microcode revision check for SNB-PEBS
perf/x86: Improve debug output in check_hw_exists()
perf/x86/amd: Unify AMD's generic and family 15h pmus
perf/x86: Move Intel specific code to intel_pmu_init()
perf/x86: Rename Intel specific macros
perf/x86: Fix USER/KERNEL tagging of samples
perf tools: Split event symbols arrays to hw and sw parts
perf tools: Split out PE_VALUE_SYM parsing token to SW and HW tokens
perf tools: Add empty rule for new line in event syntax parsing
perf test: Use ARRAY_SIZE in parse events tests
tools lib traceevent: Cleanup realloc use
...
Pull RCU changes from Ingo Molnar:
"Quoting from Paul, the major features of this series are:
1. Preventing latency spikes of more than 200 microseconds for
kernels built with NR_CPUS=4096, which is reportedly becoming the
default for some distros. This is a first step, as it does not
help with systems that actually -have- 4096 CPUs (work on this case
is in progress, but is not yet ready for mainline).
This category also includes improving concurrency of rcu_barrier(),
placed here due to conflicts. Posted to LKML at:
https://lkml.org/lkml/2012/6/22/381
Note that patches 18-22 of that series have been defered to 3.7, as
they have not yet proven themselves to be mainline-ready (and yes,
these are the ones intended to get rid of RCU's latency spikes for
systems that actually have 4096 CPUs).
2. Updates to documentation and rcutorture fixes, the latter category
including improvements to rcu_barrier() testing. Posted to LKML at
http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/04094.html.
3. Miscellaneous fixes posted to LKML at:
https://lkml.org/lkml/2012/6/22/500
with the exception of the last commit, which was posted here:
http://www.gossamer-threads.com/lists/linux/kernel/1561830
4. RCU_FAST_NO_HZ fixes and improvements. Posted to LKML at:
http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/00006.htmlhttp://www.gossamer-threads.com/lists/linux/kernel/1561833
The first four patches of the first series went into 3.5 to fix a
regression.
5. Code-style fixes. These were posted to LKML at
http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01180.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01181.html"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rcu: Fix broken strings in RCU's source code.
rcu: Fix code-style issues involving "else"
rcu: Introduce check for callback list/count mismatch
rcu: Make RCU_FAST_NO_HZ respect nohz= boot parameter
rcu: Fix qlen_lazy breakage
rcu: Round FAST_NO_HZ lazy timeout to nearest second
rcu: The rcu_needs_cpu() function is not a quiescent state
rcu: Dump only the current CPU's buffers for idle-entry/exit warnings
rcu: Add check for CPUs going offline with callbacks queued
rcu: Disable preemption in rcu_blocking_is_gp()
rcu: Prevent uninitialized string in RCU CPU stall info
rcu: Fix rcu_is_cpu_idle() #ifdef in TINY_RCU
rcu: Split RCU core processing out of __call_rcu()
rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
rcu: Make __call_rcu() handle invocation from idle
rcu: Remove function versions of __kfree_rcu and __is_kfree_rcu_offset
rcu: Consolidate tree/tiny __rcu_read_{,un}lock() implementations
rcu: Remove return value from rcu_assign_pointer()
key: Remove extraneous parentheses from rcu_assign_keypointer()
rcu: Remove return value from RCU_INIT_POINTER()
...
25511a4776 "workqueue: reimplement CPU online rebinding to handle idle
workers" added CPU locality sanity check in process_one_work(). It
triggers if a worker is executing on a different CPU without UNBOUND
or REBIND set.
This works for all normal workers but rescuers can trigger this
spuriously when they're serving the unbound or a disassociated
global_cwq - rescuers don't have either flag set and thus its
gcwq->cpu can be a different value including %WORK_CPU_UNBOUND.
Fix it by additionally testing %GCWQ_DISASSOCIATED.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
LKML-Refence: <20120721213656.GA7783@linux.vnet.ibm.com>
kthread_worker provides minimalistic workqueue-like interface for
users which need a dedicated worker thread (e.g. for realtime
priority). It has basic queue, flush_work, flush_worker operations
which mostly match the workqueue counterparts; however, due to the way
flush_work() is implemented, it has a noticeable difference of not
allowing work items to be freed while being executed.
While the current users of kthread_worker are okay with the current
behavior, the restriction does impede some valid use cases. Also,
removing this difference isn't difficult and actually makes the code
easier to understand.
This patch reimplements flush_kthread_work() such that it uses a
flush_work item instead of queue/done sequence numbers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Make the following two non-functional changes.
* Separate out insert_kthread_work() from queue_kthread_work().
* Relocate struct kthread_flush_work and kthread_flush_work_fn()
definitions above flush_kthread_work().
v2: Added lockdep_assert_held() in insert_kthread_work() as suggested
by Andy Walls.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andy Walls <awalls@md.metrocast.net>
Merge emailed kgdb dmesg fixups patches from Anton Vorontsov:
"The dmesg command appears to be broken after the printk rework. The
old logic in the kdb code makes no sense in terms of current
printk/logging storage format, and KDB simply hangs forever upon
entering 'dmesg' command.
The first patch revives the command by switching to kmsg_dumper
iterator. As a side-effect, the code is now much more simpler.
A few changes were needed in the printk.c: we needed unlocked variant
of the kmsg_dumper iterator, but these can surely wait for 3.6.
It's probably too late even for the first patch to go to 3.5, but I'll
try to convince otherwise. :-) Here we go:
- The current code is broken for sure, and has no hope to work at
all. It is a regression
- The new code works for me, and probably works for everyone else;
- If it compiles (and I urge everyone to compile-test it on your
setup), it hardly can make things worse."
* Merge emailed patches from Anton Vorontsov: (4 commits)
kdb: Switch to nolock variants of kmsg_dump functions
printk: Implement some unlocked kmsg_dump functions
printk: Remove kdb_syslog_data
kdb: Revive dmesg command
The locked variants are prone to deadlocks (suppose we got to the
debugger w/ the logbuf lock held), so let's switch to nolock variants.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If used from KDB, the locked variants are prone to deadlocks (suppose we
got to the debugger w/ the logbuf lock held).
So, we have to implement a few routines that grab no logbuf lock.
Yet we don't need these functions in modules, so we don't export them.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function is no longer needed, so remove it.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kgdb dmesg command is broken after the printk rework. The old logic
in kdb code makes no sense in terms of current printk/logging storage
format, and KDB simply hangs forever.
This patch revives the command by switching to kmsg_dumper iterator.
The code is now much more simpler and shorter.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In response to an async related regression James noted:
"My theory is that this is an init problem: The assumption in a lot of
our code is that async_synchronize_full() waits for everything ... even
the domain specific async schedules, which isn't true."
...so make this assumption true.
Each domain, including the default one, registers itself on a global domain
list when work is scheduled. Once all entries complete it exits that
list. Waiting for the list to be empty syncs all in-flight work across
all domains.
Domains can opt-out of global syncing if they are declared as exclusive
ASYNC_DOMAIN_EXCLUSIVE(). All stack-based domains have been declared
exclusive since the domain may go out of scope as soon as the last work
item completes.
Statically declared domains are mostly ok, but async_unregister_domain()
is there to close any theoretical races with pending
async_synchronize_full waiters at module removal time.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eldad Zack <eldadzack@gmail.com>
Tested-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
This is in preparation for teaching async_synchronize_full() to sync all
pending async work, and not just on the async_running domain. This
conversion is functionally equivalent, just embedding the existing list
in a new async_domain type.
The .registered attribute is used in a later patch to distinguish
between domains that want to be flushed by async_synchronize_full()
versus those that only expect async_synchronize_{full|cookie}_domain to
be used for flushing.
[jejb: add async.h to scsi_priv.h for struct async_domain]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Tested-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
There are tools like makedumpfile and vmcore-dmesg which can extract
kernel log buffer from vmcore. Since we introduced structured logging,
that functionality is broken. Now user space tools need to know about
"struct log" and offsets of various fields to be able to parse struct
log data and extract text message or dictonary.
This patch exports some of the fields.
Currently I am not exporting log "level" info as that is a bitfield and
offsetof() bitfields can't be calculated. But if people start asking for
log level info in the output then we probably either need to seprate
out "level" or use bit shift operations for flags and level.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a way to have different functions calling different trampolines.
If a ftrace_ops wants regs saved on the return, then have only the
functions with ops registered to save regs. Functions registered by
other ops would not be affected, unless the functions overlap.
If one ftrace_ops registered functions A, B and C and another ops
registered fucntions to save regs on A, and D, then only functions
A and D would be saving regs. Function B and C would work as normal.
Although A is registered by both ops: normal and saves regs; this is fine
as saving the regs is needed to satisfy one of the ops that calls it
but the regs are ignored by the other ops function.
x86_64 implements the full regs saving, and i386 just passes a NULL
for regs to satisfy the ftrace_ops passing. Where an arch must supply
both regs and ftrace_ops parameters, even if regs is just NULL.
It is OK for an arch to pass NULL regs. All function trace users that
require regs passing must add the flag FTRACE_OPS_FL_SAVE_REGS when
registering the ftrace_ops. If the arch does not support saving regs
then the ftrace_ops will fail to register. The flag
FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED may be set that will prevent the
ftrace_ops from failing to register. In this case, the handler may
either check if regs is not NULL or check if ARCH_SUPPORTS_FTRACE_SAVE_REGS.
If the arch supports passing regs it will set this macro and pass regs
for ops that request them. All other archs will just pass NULL.
Link: Link: http://lkml.kernel.org/r/20120711195745.107705970@goodmis.org
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Return as the 4th paramater to the function tracer callback the pt_regs.
Later patches that implement regs passing for the architectures will require
having the ftrace_ops set the SAVE_REGS flag, which will tell the arch
to take the time to pass a full set of pt_regs to the ftrace_ops callback
function. If the arch does not support it then it should pass NULL.
If an arch can pass full regs, then it should define:
ARCH_SUPPORTS_FTRACE_SAVE_REGS to 1
Link: http://lkml.kernel.org/r/20120702201821.019966811@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the function tracer starts to get more features, the support for
theses features will spread out throughout the different architectures
over time. These features boil down to what each arch does in the
mcount trampoline (the ftrace_caller).
Currently there's two features that are not the same throughout the
archs.
1) Support to stop function tracing before the callback
2) passing of the ftrace ops
Both of these require placing an indirect function to support the
features if the mcount trampoline does not.
On a side note, for all architectures, when more than one callback
is registered to the function tracer, an intermediate 'list' function
is called by the mcount trampoline to iterate through the callbacks
that are registered.
Instead of making a separate function for each of these features,
and requiring several indirect calls, just use the single 'list' function
as the intermediate, to handle all cases. If an arch does not support
the 'stop function tracing' or the passing of ftrace ops, just force
it to use the list function that will handle the features required.
This makes the code cleaner and simpler and removes a lot of
#ifdefs in the code.
Link: http://lkml.kernel.org/r/20120612225424.495625483@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the function trace callback receives only the ip and parent_ip
of the function that it traced. It would be more powerful to also return
the ops that registered the function as well. This allows the same function
to act differently depending on what ftrace_ops registered it.
Link: http://lkml.kernel.org/r/20120612225424.267254552@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the new interrupt sampling system, we are no longer using the
timer_rand_state structure in the irq descriptor, so we can stop
initializing it now.
[ Merged in fixes from Sedat to find some last missing references to
rand_initialize_irq() ]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Commit a7a20d1039 ("sd: limit the scope of the async probe domain")
make the SCSI device probing run device discovery in it's own async
domain.
However, as a result, the partition detection was no longer synchronized
by async_synchronize_full() (which, despite the name, only synchronizes
the global async space, not all of them). Which in turn meant that
"wait_for_device_probe()" would not wait for the SCSI partitions to be
parsed.
And "wait_for_device_probe()" was what the boot time init code relied on
for mounting the root filesystem.
Now, most people never noticed this, because not only is it
timing-dependent, but modern distributions all use initrd. So the root
filesystem isn't actually on a disk at all. And then before they
actually mount the final disk filesystem, they will have loaded the
scsi-wait-scan module, which not only does the expected
wait_for_device_probe(), but also does scsi_complete_async_scans().
[ Side note: scsi_complete_async_scans() had also been partially broken,
but that was fixed in commit 43a8d39d01 ("fix async probe
regression"), so that same commit a7a20d1039 had actually broken
setups even if you used scsi-wait-scan explicitly ]
Solve this problem by just moving the scsi_complete_async_scans() call
into wait_for_device_probe(). Everybody who wants to wait for device
probing to finish really wants the SCSI probing to complete, so there's
no reason not to do this.
So now "wait_for_device_probe()" really does what the name implies, and
properly waits for device probing to finish. This also removes the now
unnecessary extra calls to scsi_complete_async_scans().
Reported-and-tested-by: Artem S. Tashkinov <t.artem@mailcity.com>
Cc: Dan Williams <dan.j.williams@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: linux-scsi <linux-scsi@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Require processes wanting to use the wake_lock/wake_unlock sysfs
files to have the CAP_BLOCK_SUSPEND capability, which also is
required for the eventpoll EPOLLWAKEUP flag to be effective, so that
all interfaces related to blocking autosleep depend on the same
capability.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
Acked-by: Michael Kerrisk <mtk.man-pages@gmail.com>
One more time/ntp fix pulled from Ingo Molnar.
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ntp: Fix STA_INS/DEL clearing bug
With trustee gone, CPU hotplug code can be simplified.
* gcwq_claim/release_management() now grab and release gcwq lock too
respectively and gained _and_lock and _and_unlock postfixes.
* All CPU hotplug logic was implemented in workqueue_cpu_callback()
which was called by workqueue_cpu_up/down_callback() for the correct
priority. This was because up and down paths shared a lot of logic,
which is no longer true. Remove workqueue_cpu_callback() and move
all hotplug logic into the two actual callbacks.
This patch doesn't make any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
With the previous changes, a disassociated global_cwq now can run as
an unbound one on its own - it can create workers as necessary to
drain remaining works after the CPU has been brought down and manage
the number of workers using the usual idle timer mechanism making
trustee completely redundant except for the actual unbinding
operation.
This patch removes the trustee and let a disassociated global_cwq
manage itself. Unbinding is moved to a work item (for CPU affinity)
which is scheduled and flushed from CPU_DONW_PREPARE.
This patch moves nr_running clearing outside gcwq and manager locks to
simplify the code. As nr_running is unused at the point, this is
safe.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Currently, during CPU offlining, after all pending work items are
drained, the trustee butchers all workers. Also, on CPU onlining
failure, workqueue_cpu_callback() ensures that the first idle worker
is destroyed. Combined, these guarantee that an offline CPU doesn't
have any worker for it once all the lingering work items are finished.
This guarantee isn't really necessary and makes CPU on/offlining more
expensive than needs to be, especially for platforms which use CPU
hotplug for powersaving.
This patch lets offline CPUs removes idle worker butchering from the
trustee and let a CPU which failed onlining keep the created first
worker. The first worker is created if the CPU doesn't have any
during CPU_DOWN_PREPARE and started right away. If onlining succeeds,
the rebind_workers() call in CPU_ONLINE will rebind it like any other
workers. If onlining fails, the worker is left alone till the next
try.
This makes CPU hotplugs cheaper by allowing global_cwqs to keep
workers across them and simplifies code.
Note that trustee doesn't re-arm idle timer when it's done and thus
the disassociated global_cwq will keep all workers until it comes back
online. This will be improved by further patches.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Currently, if there are left workers when a CPU is being brough back
online, the trustee kills all idle workers and scheduled rebind_work
so that they re-bind to the CPU after the currently executing work is
finished. This works for busy workers because concurrency management
doesn't try to wake up them from scheduler callbacks, which require
the target task to be on the local run queue. The busy worker bumps
concurrency counter appropriately as it clears WORKER_UNBOUND from the
rebind work item and it's bound to the CPU before returning to the
idle state.
To reduce CPU on/offlining overhead (as many embedded systems use it
for powersaving) and simplify the code path, workqueue is planned to
be modified to retain idle workers across CPU on/offlining. This
patch reimplements CPU online rebinding such that it can also handle
idle workers.
As noted earlier, due to the local wakeup requirement, rebinding idle
workers is tricky. All idle workers must be re-bound before scheduler
callbacks are enabled. This is achieved by interlocking idle
re-binding. Idle workers are requested to re-bind and then hold until
all idle re-binding is complete so that no bound worker starts
executing work item. Only after all idle workers are re-bound and
parked, CPU_ONLINE proceeds to release them and queue rebind work item
to busy workers thus guaranteeing scheduler callbacks aren't invoked
until all idle workers are ready.
worker_rebind_fn() is renamed to busy_worker_rebind_fn() and
idle_worker_rebind() for idle workers is added. Rebinding logic is
moved to rebind_workers() and now called from CPU_ONLINE after
flushing trustee. While at it, add CPU sanity check in
worker_thread().
Note that now a worker may become idle or the manager between trustee
release and rebinding during CPU_ONLINE. As the previous patch
updated create_worker() so that it can be used by regular manager
while unbound and this patch implements idle re-binding, this is safe.
This prepares for removal of trustee and keeping idle workers across
CPU hotplugs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Currently, create_worker()'s callers are responsible for deciding
whether the newly created worker should be bound to the associated CPU
and create_worker() sets WORKER_UNBOUND only for the workers for the
unbound global_cwq. Creation during normal operation is always via
maybe_create_worker() and @bind is true. For workers created during
hotplug, @bind is false.
Normal operation path is planned to be used even while the CPU is
going through hotplug operations or offline and this static decision
won't work.
Drop @bind from create_worker() and decide whether to bind by looking
at GCWQ_DISASSOCIATED. create_worker() will also set WORKER_UNBOUND
autmatically if disassociated. To avoid flipping GCWQ_DISASSOCIATED
while create_worker() is in progress, the flag is now allowed to be
changed only while holding all manager_mutexes on the global_cwq.
This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's
back. CPU_ONLINE no longer clears DISASSOCIATED before flushing
trustee, which clears DISASSOCIATED before rebinding remaining workers
if asked to release. For cases where trustee isn't around, CPU_ONLINE
clears DISASSOCIATED after flushing trustee. Also, now, first_idle
has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE
while binding it. These convolutions will soon be removed by further
simplification of CPU hotplug path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
POOL_MANAGING_WORKERS is used to ensure that at most one worker takes
the manager role at any given time on a given global_cwq. Trustee
later hitched on it to assume manager adding blocking wait for the
bit. As trustee already needed a custom wait mechanism, waiting for
MANAGING_WORKERS was rolled into the same mechanism.
Trustee is scheduled to be removed. This patch separates out
MANAGING_WORKERS wait into per-pool mutex. Workers use
mutex_trylock() to test for manager role and trustee uses mutex_lock()
to claim manager roles.
gcwq_claim/release_management() helpers are added to grab and release
manager roles of all pools on a global_cwq. gcwq_claim_management()
always grabs pool manager mutexes in ascending pool index order and
uses pool index as lockdep subclass.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Currently, WORKER_UNBOUND is used to mark workers for the unbound
global_cwq and WORKER_ROGUE is used to mark workers for disassociated
per-cpu global_cwqs. Both are used to make the marked worker skip
concurrency management and the only place they make any difference is
in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling
idle timer, which can easily be replaced with trustee state testing.
This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops
WORKER_ROGUE. This is to prepare for removing trustee and handling
disassociated global_cwqs as unbound.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Workqueue used CPU_DYING notification to mark GCWQ_DISASSOCIATED.
This was necessary because workqueue's CPU_DOWN_PREPARE happened
before other DOWN_PREPARE notifiers and workqueue needed to stay
associated across the rest of DOWN_PREPARE.
After the previous patch, workqueue's DOWN_PREPARE happens after
others and can set GCWQ_DISASSOCIATED directly. Drop CPU_DYING and
let the trustee set GCWQ_DISASSOCIATED after disabling concurrency
management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Currently, all workqueue cpu hotplug operations run off
CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to
ensure that workqueue is up and running while bringing up a CPU before
other notifiers try to use workqueue on the CPU.
Per-cpu workqueues are supposed to remain working and bound to the CPU
for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even
with workqueue offlining running with higher priority because
workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which
runs the per-cpu workqueue without concurrency management without
explicitly detaching the existing workers.
However, if the trustee needs to create new workers, it creates
unbound workers which may wander off to other CPUs while
CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU
down is cancelled, the per-CPU workqueue may end up with workers which
aren't bound to the CPU.
While reliably reproducible with a convoluted artificial test-case
involving scheduling and flushing CPU burning work items from CPU down
notifiers, this isn't very likely to happen in the wild, and, even
when it happens, the effects are likely to be hidden by the following
successful CPU down.
Fix it by using different priorities for up and down notifiers - high
priority for up operations and low priority for down operations.
Workqueue cpu hotplug operations will soon go through further cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Since the function accepts just one bit, we can use the switch
construction instead of if/else if/...
Just a cosmetic change, there should be no functional changes.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch introduces 'func_ptrace' option, now available in
/sys/kernel/debug/tracing/options when function tracer
is selected.
The patch also adds some tiny code that calls back to pstore
to record the trace. The callback is no-op when PSTORE=n.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If tracer->init() fails, current code will leave current_tracer pointing
to an unusable tracer, which at best makes 'current_tracer' report
inaccurate value.
Fix the issue by pointing current_tracer to nop tracer, and only update
current_tracer with the new one after all the initialization succeeds.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fragments of continuation lines are flushed to the console immediately. In
case the console is locked, the fragment must be queued up in the cont
buffer.
If the the console is busy and the continuation line is complete, but no part
of it was written to the console up to this point, we can just store the
entire line as a regular record and free the buffer earlier.
If the console is busy and earlier messages are already queued up, we
should not flush the fragments of continuation lines, but store them after
the queued up messages, to ensure the proper ordering.
This keeps the console output better readable in case printk()s race against
each other, or we receive over-long continuation lines we need to flush.
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In some cases we are forced to store individual records for a continuation
line print.
Export a flag to allow the external re-construction of the line. The flag
allows us to apply a similar logic externally which is used internally when
the console, /proc/kmsg or the syslog() output is printed.
$ cat /dev/kmsg
4,165,0,-;Free swap = 0kB
4,166,0,-;Total swap = 0kB
6,167,0,c;[
4,168,0,+;0
4,169,0,+;1
4,170,0,+;2
4,171,0,+;3
4,172,0,+;]
6,173,0,-;[0 1 2 3 ]
6,174,0,-;Console: colour VGA+ 80x25
6,175,0,-;console [tty0] enabled
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reserve PREFIX_MAX bytes in the LOG_LINE_MAX line when buffering a
continuation line, to be able to properly prefix the LOG_LINE_MAX
line with the syslog prefix and timestamp when printing it.
Reported-By: Dave Jones <davej@redhat.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The leap second rework unearthed another issue of inconsistent data.
On timekeeping_resume() the timekeeper data is updated, but nothing
calls timekeeping_update(), so now the update code in the timer
interrupt sees stale values.
This has been the case before those changes, but then the timer
interrupt was using stale data as well so this went unnoticed for quite
some time.
Add the missing update call, so all the data is consistent everywhere.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Reported-and-tested-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Reported-and-tested-by: Martin Steigerwald <Martin@lichtvoll.de>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Linux PM list <linux-pm@vger.kernel.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As part of cleaning up the timekeeping code, this patch converts
a number of internal functions to takei a timekeeper ptr as an
argument, so that the internal functions don't access the global
timekeeper structure directly. This allows for further optimizations
to reduce lock hold time later.
This patch has been updated to include more consistent usage of the
timekeeper value, by making sure it is always passed as a argument
to non top-level functions.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-9-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When we make adjustments speeding up the clock, its possible
for xtime_nsec to underflow. We already handle this properly,
but we do so from update_wall_time() instead of the more logical
timekeeping_adjust(), where the possible underflow actually
occurs.
Thus, move the correction logic to the timekeeping_adjust, which
is the function that causes the issue. Making update_wall_time()
more readable.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-8-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Since we call arch_gettimeoffset() in all the accessor
functions, move arch_gettimeoffset() calls into
timekeeping_get_ns() and timekeeping_get_ns_raw() to simplify
the code.
This also makes the code easier to maintain as we don't have to
worry about forgetting the arch_gettimeoffset() as has happened
in the past.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-7-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We do the exact same logic moving nsecs to secs in the
timekeeper in multiple places, so condense this into a
single function.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-6-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The timekeeper struct has a xtime_nsec, which keeps the
sub-nanosecond remainder. This ends up being somewhat
duplicative of the timekeeper.xtime.tv_nsec value, and we
have to do extra work to keep them apart, copying the full
nsec portion out and back in over and over.
This patch simplifies some of the logic by taking the timekeeper
xtime value and splitting it into timekeeper.xtime_sec and
reuses the timekeeper.xtime_nsec for the sub-second portion
(stored in higher res shifted nanoseconds).
This simplifies some of the accumulation logic. And will
allow for more accurate timekeeping once the vsyscall code
is updated to use the shifted nanosecond remainder.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Ingo noted that using a u32 instead of int for shift values
would be better to make sure the compiler doesn't unnecessarily
use complex signed arithmetic.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Ingo noted a number of places where there is inconsistent
use of whitespace. This patch tries to address the main
culprits.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reason: Update to upstream changes to avoid further conflicts.
Fixup a trivial merge conflict in kernel/time/tick-sched.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In commit 6b43ae8a61, I
introduced a bug that kept the STA_INS or STA_DEL bit
from being cleared from time_status via adjtimex()
without forcing STA_PLL first.
Usually once the STA_INS is set, it isn't cleared
until the leap second is applied, so its unlikely this
affected anyone. However during testing I noticed it
took some effort to cancel a leap second once STA_INS
was set.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
CC: stable@vger.kernel.org # 3.4
Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We've been moving away from add_interrupt_randomness() for various
reasons: it's too expensive to do on every interrupt, and flooding the
CPU with interrupts could theoretically cause bogus floods of entropy
from a somewhat externally controllable source.
This solves both problems by limiting the actual randomness addition
to just once a second or after 64 interrupts, whicever comes first.
During that time, the interrupt cycle data is buffered up in a per-cpu
pool. Also, we make sure the the nonblocking pool used by urandom is
initialized before we start feeding the normal input pool. This
assures that /dev/urandom is returning unpredictable data as soon as
possible.
(Based on an original patch by Linus, but significantly modified by
tytso.)
Tested-by: Eric Wustrow <ewust@umich.edu>
Reported-by: Eric Wustrow <ewust@umich.edu>
Reported-by: Nadia Heninger <nadiah@cs.ucsd.edu>
Reported-by: Zakir Durumeric <zakir@umich.edu>
Reported-by: J. Alex Halderman <jhalderm@umich.edu>.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Pull RCU, perf, and scheduler fixes from Ingo Molnar.
The RCU fix is a revert for an optimization that could cause deadlocks.
One of the scheduler commits (164c33c6ad "sched: Fix fork() error path
to not crash") is correct but not complete (some architectures like Tile
are not covered yet) - the resulting additional fixes are still WIP and
Ingo did not want to delay these pending fixes. See this thread on
lkml:
[PATCH] fork: fix error handling in dup_task()
The perf fixes are just trivial oneliners.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf kvm: Fix segfault with report and mixed guestmount use
perf kvm: Fix regression with guest machine creation
perf script: Fix format regression due to libtraceevent merge
ring-buffer: Fix accounting of entries when removing pages
ring-buffer: Fix crash due to uninitialized new_pages list head
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS/sched: Update scheduler file pattern
sched/nohz: Rewrite and fix load-avg computation -- again
sched: Fix fork() error path to not crash
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called. They could also be passed to the
compare function.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
copy_tree() can theoretically fail in a case other than ENOMEM, but always
returns NULL which is interpreted by callers as -ENOMEM. Change it to return
an explicit error.
Also change clone_mnt() for consistency and because union mounts will add new
error cases.
Thanks to Andreas Gruenbacher <agruen@suse.de> for a bug fix.
[AV: folded braino fix by Dan Carpenter]
Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Valerie Aurora <valerie.aurora@gmail.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
all callers want the same thing, actually - a kinda-sorta analog of
kern_path_create(). I.e. they want parent vfsmount/dentry (with
->i_mutex held, to make sure the child dentry is still their child)
+ the child dentry.
Signed-off-by Al Viro <viro@zeniv.linux.org.uk>
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
WQ_HIGHPRI was implemented by queueing highpri work items at the head
of the global worklist. Other than queueing at the head, they weren't
handled differently; unfortunately, this could lead to execution
latency of a few seconds on heavily loaded systems.
Now that workqueue code has been updated to deal with multiple
worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using
a separate worker_pool. NR_WORKER_POOLS is bumped to two and
gcwq->pools[0] is used for normal pri work items and ->pools[1] for
highpri. Highpri workers get -20 nice level and has 'H' suffix in
their names. Note that this change increases the number of kworkers
per cpu.
POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain
wakeup code in process_one_work() are no longer used and removed.
This allows proper prioritization of highpri work items and removes
high execution latency of highpri work items.
v2: nr_running indexing bug in get_pool_nr_running() fixed.
v3: Refreshed for the get_pool_nr_running() update in the previous
patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josh Hunt <joshhunt00@gmail.com>
LKML-Reference: <CAKA=qzaHqwZ8eqpLNFjxnO2fX-tgAOjmpvxgBFjv6dJeQaOW1w@mail.gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code
paths which need to manipulate all pools in a gcwq to use them.
NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates
over only @gcwq->pool.
Note that nr_running is per-pool property and converted to an array
with NR_WORKER_POOLS elements and renamed to pool_nr_running. Note
that get_pool_nr_running() currently assumes 0 index. The next patch
will make use of non-zero index.
The changes in this patch are mechanical and don't caues any
functional difference. This is to prepare for multiple pools per
gcwq.
v2: nr_running indexing bug in get_pool_nr_running() fixed.
v3: Pointer to array is stupid. Don't use it in get_pool_nr_running()
as suggested by Linus.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Pull the leap second fixes from Thomas Gleixner:
"It's a rather large series, but well discussed, refined and reviewed.
It got a massive testing by John, Prarit and tip.
In theory we could split it into two parts. The first two patches
f55a6faa38: hrtimer: Provide clock_was_set_delayed()
4873fa070a: timekeeping: Fix leapsecond triggered load spike issue
are merely preventing the stuff loops forever issues, which people
have observed.
But there is no point in delaying the other 4 commits which achieve
full correctness into 3.6 as they are tagged for stable anyway. And I
rather prefer to have the full fixes merged in bulk than a "prevent
the observable wreckage and deal with the hidden fallout later"
approach."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Update hrtimer base offsets each hrtimer_interrupt
timekeeping: Provide hrtimer update function
hrtimers: Move lock held region in hrtimer_interrupt()
timekeeping: Maintain ktime_t based offsets for hrtimers
timekeeping: Fix leapsecond triggered load spike issue
hrtimer: Provide clock_was_set_delayed()
GCWQ_MANAGE_WORKERS, GCWQ_MANAGING_WORKERS and GCWQ_HIGHPRI_PENDING
are per-pool properties. Add worker_pool->flags and make the above
three flags per-pool flags.
The changes in this patch are mechanical and don't caues any
functional difference. This is to prepare for multiple pools per
gcwq.
Signed-off-by: Tejun Heo <tj@kernel.org>
Modify all functions which deal with per-pool properties to pass
around @pool instead of @gcwq or @cpu.
The changes in this patch are mechanical and don't caues any
functional difference. This is to prepare for multiple pools per
gcwq.
Signed-off-by: Tejun Heo <tj@kernel.org>
Move worklist and all worker management fields from global_cwq into
the new struct worker_pool. worker_pool points back to the containing
gcwq. worker and cpu_workqueue_struct are updated to point to
worker_pool instead of gcwq too.
This change is mechanical and doesn't introduce any functional
difference other than rearranging of fields and an added level of
indirection in some places. This is to prepare for multiple pools per
gcwq.
v2: Comment typo fixes as suggested by Namhyung.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Unbound wqs aren't concurrency-managed and try to execute work items
as soon as possible. This is currently achieved by implicitly setting
%WQ_HIGHPRI on all unbound workqueues; however, WQ_HIGHPRI
implementation is about to be restructured and this usage won't be
valid anymore.
Add an explicit chain-wakeup path for unbound workqueues in
process_one_work() instead of piggy backing on %WQ_HIGHPRI.
Signed-off-by: Tejun Heo <tj@kernel.org>
Clean up and return -ENOMEM on if the kzalloc() fails.
This also prevents a potential crash, as the pointer that failed to
allocate would be later used.
Link: http://lkml.kernel.org/r/20120711063507.GF11812@elgon.mountain
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
"no other files mapped" requirement from my previous patch (c/r: prctl:
update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
paranoid, it forbids operation even if there mapped one shared-anon vma.
Let's check that current mm->exe_file already unmapped, in this case
exe_file symlink already outdated and its changing is reasonable.
Plus, this patch fixes exit code in case operation success.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The update of the hrtimer base offsets on all cpus cannot be made
atomically from the timekeeper.lock held and interrupt disabled region
as smp function calls are not allowed there.
clock_was_set(), which enforces the update on all cpus, is called
either from preemptible process context in case of do_settimeofday()
or from the softirq context when the offset modification happened in
the timer interrupt itself due to a leap second.
In both cases there is a race window for an hrtimer interrupt between
dropping timekeeper lock, enabling interrupts and clock_was_set()
issuing the updates. Any interrupt which arrives in that window will
see the new time but operate on stale offsets.
So we need to make sure that an hrtimer interrupt always sees a
consistent state of time and offsets.
ktime_get_update_offsets() allows us to get the current monotonic time
and update the per cpu hrtimer base offsets from hrtimer_interrupt()
to capture a consistent state of monotonic time and the offsets. The
function replaces the existing ktime_get() calls in hrtimer_interrupt().
The overhead of the new function vs. ktime_get() is minimal as it just
adds two store operations.
This ensures that any changes to realtime or boottime offsets are
noticed and stored into the per-cpu hrtimer base structures, prior to
any hrtimer expiration and guarantees that timers are not expired early.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We need to update the base offsets from this code and we need to do
that under base->lock. Move the lock held region around the
ktime_get() calls. The ktime_get() calls are going to be replaced with
a function which gets the time and the offsets atomically.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We need to update the hrtimer clock offsets from the hrtimer interrupt
context. To avoid conversions from timespec to ktime_t maintain a
ktime_t based representation of those offsets in the timekeeper. This
puts the conversion overhead into the code which updates the
underlying offsets and provides fast accessible values in the hrtimer
interrupt.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The timekeeping code misses an update of the hrtimer subsystem after a
leap second happened. Due to that timers based on CLOCK_REALTIME are
either expiring a second early or late depending on whether a leap
second has been inserted or deleted until an operation is initiated
which causes that update. Unless the update happens by some other
means this discrepancy between the timekeeping and the hrtimer data
stays forever and timers are expired either early or late.
The reported immediate workaround - $ data -s "`date`" - is causing a
call to clock_was_set() which updates the hrtimer data structures.
See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix
Add the missing clock_was_set() call to update_wall_time() in case of
a leap second event. The actual update is deferred to softirq context
as the necessary smp function call cannot be invoked from hard
interrupt context.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
clock_was_set() cannot be called from hard interrupt context because
it calls on_each_cpu().
For fixing the widely reported leap seconds issue it is necessary to
call it from hard interrupt context, i.e. the timer tick code, which
does the timekeeping updates.
Provide a new function which denotes it in the hrtimer cpu base
structure of the cpu on which it is called and raise the hrtimer
softirq. We then execute the clock_was_set() notificiation from
softirq context in run_hrtimer_softirq(). The hrtimer softirq is
rarely used, so polling the flag there is not a performance issue.
[ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get
rid of all this ifdeffery ASAP ]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Here are some more printk fixes for 3.5-rc6. They resolve all known
outstanding issues with the printk changes that have been happening. They have
been tested by the people reporting the problems.
This hopefully should be it for the printk stuff for 3.5-final.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAk/9g5IACgkQMUfUDdst+ykGRgCgsLQ+ltx2CExSNZ29Z9OVi1cW
KFAAoMmZCJkrj7gyCX4Y/UZ7qa+iYm7T
=nVVT
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull printk fixes from Greg Kroah-Hartman:
"Here are some more printk fixes for 3.5-rc6. They resolve all known
outstanding issues with the printk changes that have been happening.
They have been tested by the people reporting the problems.
This hopefully should be it for the printk stuff for 3.5-final.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kmsg: merge continuation records while printing
kmsg: /proc/kmsg - support reading of partial log records
kmsg: make sure all messages reach a newly registered boot console
kmsg: properly handle concurrent non-blocking read() from /proc/kmsg
kmsg: add the facility number to the syslog prefix
kmsg: escape the backslash character while exporting data
printk: replacing the raw_spin_lock/unlock with raw_spin_lock/unlock_irq
irq_create_direct_mapping can only be used with the NOMAP type. Make
the function test to ensure it is passed the correct type of
irq_domain.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
In preparation to remove the slow revmap path, eliminate the public
radix revmap lookup functions. This simplifies the code and makes the
slowpath removal patch a lot simpler.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
This adds a new strict mapping API for supporting creation of linux IRQs
at existing positions within the domain. The new routines are as follows:
For dynamic allocation and insertion to specified ranges:
- irq_create_identity_mapping()
- irq_create_strict_mappings()
These will allocate and associate a range of linux IRQs at the specified
location. This can be used by controllers that have their own static linux IRQ
definitions to map a hwirq range to, as well as for platforms that wish to
establish 1:1 identity mapping between linux and hwirq space.
For insertion to specified ranges by platforms that do their own irq_desc
management:
- irq_domain_associate()
- irq_domain_associate_many()
These in turn call back in to the domain's ->map() routine, for further
processing by the platform. Disassociation of IRQs get handled through
irq_dispose_mapping() as normal.
With these in place it should be possible to begin migration of legacy IRQ
domains to linear ones, without requiring special handling for static vs
dynamic IRQ definitions in DT vs non-DT paths. This also makes it possible
for domains with static mappings to adopt whichever tree model best fits
their needs, rather than simply restricting them to linear revmaps.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
[grant.likely: Reorganized irq_domain_associate{,_many} to have all logic in one place]
[grant.likely: Add error checking for unallocated irq_descs at associate time]
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
At irq_setup_virq() time all of the data needed to update the reverse
map is available, but the current code ignores it and relies upon the
slow path to insert revmap records. This patch adds revmap updating
to the setup path so the slow path will no longer be necessary.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
This patch moves the irq disassociation code out into a separate
function in preparation to extend irq_setup_virq to handle multiple
irqs and rename it for use by interrupt controller drivers. The new
function will be used by irq_setup_virq() in its error path.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
The revmap type should be linear for irq_domain_add_linear function.
Signed-off-by: Dong Aisheng <dong.aisheng@linaro.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
A large proportion of interrupt controllers that support legacy mappings
do so because non-DT systems need to use fixed IRQ numbers when registering
devices via buses but can otherwise use a linear mapping. The interrupt
controller itself typically is not affected by the mapping used and best
practice is to use a linear mapping where possible so drivers frequently
select at runtime depending on if a legacy range has been allocated to
them.
Standardise this behaviour by providing irq_domain_register_simple() which
will allocate a linear mapping unless a positive first_irq is provided in
which case it will fall back to a legacy mapping. This helps make best
practice for irq_domain adoption clearer.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
In (the unlikely) case our continuation merge buffer is busy, we unfortunately
can not merge further continuation printk()s into a single record and have to
store them separately, which leads to split-up output of these lines when they
are printed.
Add some flags about newlines and prefix existence to these records and try to
reconstruct the full line again, when the separated records are printed.
Reported-By: Michael Neuling <mikey@neuling.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Tested-By: Michael Neuling <mikey@neuling.org>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While refactoring cgroup file removal path, 05ef1d7c4a "cgroup:
introduce struct cfent" incorrectly changed the @dir argument of
simple_unlink() to the inode of the file being deleted instead of that
of the containing directory.
The effect of this bug is minor - ctime and mtime of the parent
weren't properly updated on file deletion.
Fix it by using @cgrp->dentry->d_inode instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
Restore support for partial reads of any size on /proc/kmsg, in case the
supplied read buffer is smaller than the record size.
Some people seem to think is is ia good idea to run:
$ dd if=/proc/kmsg bs=1 of=...
as a klog bridge.
Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=44211
Reported-by: Jukka Ollila <jiiksteri@gmail.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed. As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.
Destroying a superblock which has dentries with positive refcnts is a
critical bug and triggers BUG() in vfs code. As each cgroup dentry
holds an s_active reference, any lingering cgroup has both its dentry
and the superblock pinned and thus preventing premature release of
superblock.
Unfortunately, after 48ddbe1946, there's a small window while
releasing a cgroup which is directly under the root of the hierarchy.
When a cgroup directory is released, vfs layer first deletes the
corresponding dentry and then invokes dput() on the parent, which may
recurse further, so when a cgroup directly below root cgroup is
released, the cgroup is first destroyed - which releases the s_active
it was holding - and then the dentry for the root cgroup is dput().
This creates a window where the root dentry's refcnt isn't zero but
superblock's s_active is. If umount happens before or during this
window, vfs will see the root dentry with non-zero refcnt and trigger
BUG().
Before 48ddbe1946, this problem didn't exist because the last dentry
reference was guaranteed to be put synchronously from rmdir(2)
invocation which holds s_active around the whole process.
Fix it by holding an extra superblock->s_active reference across
dput() from css release, which is the dput() path added by 48ddbe1946
and the only one which doesn't hold an extra s_active ref across the
final cgroup dput().
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: shyju pv <shyju.pv@huawei.com>
Tested-by: shyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
This reverts commit fa980ca87d. The
commit was an attempt to fix a race condition where a cgroup hierarchy
may be unmounted with positive dentry reference on root cgroup. While
the commit made the race condition slightly more difficult to trigger,
the race was still there and could be reliably triggered using a
different test case.
Revert the incorrect fix. The next commit will describe the race and
fix it correctly.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: shyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
We suppress printing kmsg records to the console, which are already printed
immediately while we have received their fragments.
Newly registered boot consoles print the entire kmsg buffer during
registration. Clear the console-suppress flag after we skipped the record
during its first storage, so any later print will see these records as usual.
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The /proc/kmsg read() interface is internally simply wired up to a sequence
of syslog() syscalls, which might are racy between their checks and actions,
regarding concurrency.
In the (very uncommon) case of concurrent readers of /dev/kmsg, relying on
usual O_NONBLOCK behavior, the recently introduced mutex might block an
O_NONBLOCK reader in read(), when poll() returns for it, but another process
has already read the data in the meantime. We've seen that while running
artificial test setups and tools that "fight" about /proc/kmsg data.
This restores the original /proc/kmsg behavior, where in case of concurrent
read()s, poll() might wake up but the read() syscall will just return 0 to
the caller, while another process has "stolen" the data.
This is in the general case not the expected behavior, but it is the exact
same one, that can easily be triggered with a 3.4 kernel, and some tools
might just rely on it.
The mutex is not needed, the original integrity issue which introduced it,
is in the meantime covered by:
"fill buffer with more than a single message for SYSLOG_ACTION_READ"
116e90b23f
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After the recent split of facility and level into separate variables,
we miss the facility value (always 0 for kernel-originated messages)
in the syslog prefix.
On Tue, Jul 3, 2012 at 12:45 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> Static checkers complain about the impossible condition here.
>
> In 084681d14e ('printk: flush continuation lines immediately to
> console'), we changed msg->level from being a u16 to being an unsigned
> 3 bit bitfield.
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Non-printable characters in the log data are hex-escaped to ensure safe
post processing. We need to escape a backslash we find in the data, to be
able to distinguish it from a backslash we add for the escaping.
Also escape the non-printable character 127.
Thanks to Miloslav Trmac for the heads up.
Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In function devkmsg_read/writev/llseek/poll/open()..., the function
raw_spin_lock/unlock is used, there is potential deadlock case happening.
CPU1: thread1 doing the cat /dev/kmsg:
raw_spin_lock(&logbuf_lock);
while (user->seq == log_next_seq) {
when thread1 run here, at this time one interrupt is coming on CPU1 and running
based on this thread,if the interrupt handle called the printk which need the
logbuf_lock spin also, it will cause deadlock.
So we should use raw_spin_lock/unlock_irq here.
Acked-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Although the C language allows you to break strings across lines, doing
this makes it hard for people to find the Linux kernel code corresponding
to a given console message. This commit therefore fixes broken strings
throughout RCU's source code.
Suggested-by: Josh Triplett <josh@joshtriplett.org>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The Linux kernel coding style says that single-statement blocks should
omit curly braces unless the other leg of the "if" statement has
multiple statements, in which case the curly braces should be included.
This commit fixes RCU's violations of this rule.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
bigrtm: First steps towards getting RCU out of the way of
tens-of-microseconds real-time response on systems compiled
with NR_CPUS=4096. Also cleanups for and increased concurrency
of rcu_barrier() family of primitives.
doctorture: rcutorture and documentation improvements.
fixes: Miscellaneous fixes.
fnh: RCU_FAST_NO_HZ fixes and improvements.
The recent bug that introduced the RCU callback list/count mismatch
showed the need for a diagnostic to check for this, which this commit
adds.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The pattern (np ? np->full_name : "<none>") is rather common in the
kernel, but can also make for quite long lines. This patch adds a new
inline function, of_node_full_name() so that the test for a valid node
pointer doesn't need to be open coded at all call sites.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Thanks to Charles Wang for spotting the defects in the current code:
- If we go idle during the sample window -- after sampling, we get a
negative bias because we can negate our own sample.
- If we wake up during the sample window we get a positive bias
because we push the sample to a known active period.
So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.
Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
Reported-and-tested-by: Charles Wang <muming.wq@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In dup_task_struct(), if arch_dup_task_struct() fails, the clean up
code fails to clean up correctly. That's because the clean up
code depends on unininitalized ti->task pointer. We fix this
by making sure that the task and thread_info know about each other
before we attempt to take the error path.
Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120626011815.11323.5533.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull block bits from Jens Axboe:
"As vacation is coming up, thought I'd better get rid of my pending
changes in my for-linus branch for this iteration. It contains:
- Two patches for mtip32xx. Killing a non-compliant sysfs interface
and moving it to debugfs, where it belongs.
- A few patches from Asias. Two legit bug fixes, and one killing an
interface that is no longer in use.
- A patch from Jan, making the annoying partition ioctl warning a bit
less annoying, by restricting it to !CAP_SYS_RAWIO only.
- Three bug fixes for drbd from Lars Ellenberg.
- A fix for an old regression for umem, it hasn't really worked since
the plugging scheme was changed in 3.0.
- A few fixes from Tejun.
- A splice fix from Eric Dumazet, fixing an issue with pipe
resizing."
* 'for-linus' of git://git.kernel.dk/linux-block:
scsi: Silence unnecessary warnings about ioctl to partition
block: Drop dead function blk_abort_queue()
block: Mitigate lock unbalance caused by lock switching
block: Avoid missed wakeup in request waitqueue
umem: fix up unplugging
splice: fix racy pipe->buffers uses
drbd: fix null pointer dereference with on-congestion policy when diskless
drbd: fix list corruption by failing but already aborted reads
drbd: fix access of unallocated pages and kernel panic
xen/blkfront: Add WARN to deal with misbehaving backends.
blkcg: drop local variable @q from blkg_destroy()
mtip32xx: Create debugfs entries for troubleshooting
mtip32xx: Remove 'registers' and 'flags' from sysfs
blkcg: fix blkg_alloc() failure path
block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
block: fix return value on cfq_init() failure
mtip32xx: Remove version.h header file inclusion
xen/blkback: Copy id field when doing BLKIF_DISCARD.
If the nohz= boot parameter disables nohz, then RCU_FAST_NO_HZ needs to
also disable itself. This commit therefore checks for tick_nohz_enabled
being zero, disabling rcu_prepare_for_idle() if so. This commit assumes
that tick_nohz_enabled can change at runtime: If this is not the case,
then a simpler approach suffices.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, if several CPUs in the same package have all lazy RCU
callbacks, their wakeups will be uncorrelated. If all the CPUs are in the
same power domain (as is often the case), this will result in unnecessary
power-ups of the package. This commit therefore uses round_jiffies()
to round the timeouts to a second boundary, increasing the odds that
they can be coalesced with each other or with other timeouts.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The TINY_PREEMPT_RCU() function rcu_preempt_needs_cpu(), which is called
from rcu_needs_cpu(), assumes that it is in a quiescent state with respect
to the CPU. This is no longer the case. This commit therefore updates
rcu_preempt_needs_cpu() to make it aware that it is not running in a
quiescent state.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Problems in RCU idle entry and exit are almost always confined to the
offending CPU. This commit therefore switches ftrace_dump() from
DUMP_ALL to DUMP_ORIG.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
If a CPU goes offline with callbacks queued, those callbacks might be
indefinitely postponed, which can result in a system hang. This commit
therefore inserts warnings for this condition.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It is time to optimize CONFIG_TREE_PREEMPT_RCU's synchronize_rcu()
for uniprocessor optimization, which means that rcu_blocking_is_gp()
can no longer rely on RCU read-side critical sections having disabled
preemption. This commit therefore disables preemption across
rcu_blocking_is_gp()'s scan of the cpu_online_mask.
(Updated from previous version to fix embarrassing bug spotted by
Wu Fengguang.)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
An uninitialized string may be displayed at the end of the rcu_preempt
detected stall info such as
0: (1 GPs behind) idle=075/140000000000000/0 =8?^D=8?^D
^^^^^^^^^^
if CONFIG_RCU_FAST_NO_HZ is not defined.
This trivial patch clears the string in this case.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_is_cpu_idle() function is used if CONFIG_DEBUG_LOCK_ALLOC,
but TINY_RCU defines it only when CONFIG_PROVE_RCU. This causes
build failures when CONFIG_DEBUG_LOCK_ALLOC=y but CONFIG_PROVE_RCU=n.
This commit therefore adjusts the #ifdefs for rcu_is_cpu_idle() so
that it is defined when CONFIG_DEBUG_LOCK_ALLOC=y.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The __call_rcu() function is a bit overweight, so this commit splits
it into actual enqueuing of and accounting for the callback (__call_rcu())
and associated RCU-core processing (__call_rcu_core()).
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The __call_rcu() function will invoke the RCU core, for example, if
it detects that the current CPU has too many callbacks. However, this
can happen on an offline CPU that is on its way to the idle loop, in
which case it is an error to invoke the RCU core, and the excess callbacks
will be adopted in any case. This commit therefore adds checks to
__call_rcu() for running on an offline CPU, refraining from invoking
the RCU core in this case.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Although __call_rcu() is handled correctly when called from a momentary
non-idle period, if it is called on a CPU that RCU believes to be idle
on RCU_FAST_NO_HZ kernels, the callback might be indefinitely postponed.
This commit therefore ensures that RCU is aware of the new callback and
has a chance to force the CPU out of dyntick-idle mode when a new callback
is posted.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The CONFIG_TREE_PREEMPT_RCU and CONFIG_TINY_PREEMPT_RCU versions of
__rcu_read_lock() and __rcu_read_unlock() are identical, so this commit
consolidates them into kernel/rcupdate.h.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The _rcu_barrier() function accesses other CPUs' rcu_data structure's
->qlen field without benefit of locking. This commit therefore adds
the required ACCESS_ONCE() wrappers around accesses and updates that
need it.
ACCESS_ONCE() is not needed when a CPU accesses its own ->qlen, or
in code that cannot run while _rcu_barrier() is sampling ->qlen fields.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
There are a couple of open-coded initializations of the rcu_data
structure's RCU callback list. This commit therefore consolidates
them into a new init_callback_list() function.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The code that attempts to identify stalls that end just as we detect
them is broken by both flavors of initialization failure. This commit
therefore properly initializes and computes the count of the number
of reasons why the RCU grace period is stalled.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The current rcutorture rcu_barrier() testing never intentionally runs
more than one instance of rcu_barrier() at a given time. This fails
to test the the shiny new concurrency features of rcu_barrier(). This
commit therefore modifies the rcutorture fakewriter kthread to randomly
invoke rcu_barrier() rather than the usual synchronize_rcu().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_torture_barrier() function has a copy-and-paste typo in the
string passed to rcutorture_shutdown_absorb(), which this commit fixes.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The child threads in the rcu_torture_barrier_cbs() are improperly
synchronized, which can cause the rcu_barrier() tests to hang. The
failure mode is as follows:
1. CPU 0 running in rcu_torture_barrier() sets barrier_cbs_count
to n_barrier_cbs.
2. CPU 1 running in rcu_torture_barrier_cbs() wakes up, posts
its RCU callback, and atomically decrements barrier_cbs_count.
Because barrier_cbs_count is not zero, it does not do the wake_up().
3. CPU 2 running in rcu_torture_barrier_cbs() wakes up, but
finds that barrier_cbs_count is not equal to n_barrier_cbs,
and so returns to sleep.
4. The value of barrier_cbs_count therefore never reaches zero,
which causes the test to hang.
This commit therefore uses a phase variable to coordinate the test,
preventing this scenario from occurring.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
SRCU now has a call_srcu() and an srcu_barrier(), but rcutorture does not
test them. This commit adds the machinery to allow rcutorture's existing
tests for call_rcu() and rcu_barrier() to apply to the SRCU equivalents.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Move the raw SRCU interfaces out of the middle of the normal SRCU
interfaces.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Before RCU had unified idle, the RCU_SAVE_DYNTICK leg of the switch
statement in force_quiescent_state() was dead code for CONFIG_NO_HZ=n
kernel builds. With unified idle, the code is never dead. This commit
therefore removes the "if" statement designed to make gcc aware of when
the code was and was not dead.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit applies the new for_each_rcu_flavor() macro to the
kernel/rcutree_trace.c file.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The arrival of TREE_PREEMPT_RCU some years back included some ugly
code involving either #ifdef or #ifdef'ed wrapper functions to iterate
over all non-SRCU flavors of RCU. This commit therefore introduces
a for_each_rcu_flavor() iterator over the rcu_state structures for each
flavor of RCU to clean up a bit of the ugliness.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
With the advent of __this_cpu_ptr(), it is no longer necessary to pass
both the rcu_state and rcu_data structures into __rcu_process_callbacks().
This commit therefore computes the rcu_data pointer from the rcu_state
pointer within __rcu_process_callbacks() so that callers can pass in
only the pointer to the rcu_state structure. This paves the way for
linking the rcu_state structures together and iterating over them.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit adds an rcubarrier file to RCU's debugfs statistical tracing
directory, providing diagnostic information on rcu_barrier().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit adds event tracing for _rcu_barrier() execution. This
is defined only if RCU_TRACE=y.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The traditional rcu_barrier() implementation has serialized all requests,
regardless of RCU flavor, and also does not coalesce concurrent requests.
In the past, this has been good and sufficient.
However, systems are getting larger and use of rcu_barrier() has been
increasing. This commit therefore introduces a counter-based scheme
that allows _rcu_barrier() calls for the same flavor of RCU to take
advantage of each others' work.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
For global variables, C defaults all fields to zero. The initialization
of the rcu_state structure's ->n_force_qs and ->n_force_qs_ngp fields
is therefore redundant, so this commit removes these initializations.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In order to allow each RCU flavor to concurrently execute its
rcu_barrier() function, it is necessary to move the relevant
state to the rcu_state structure. This commit therefore moves the
rcu_barrier_mutex global variable to a new ->barrier_mutex field
in the rcu_state structure.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In order to allow each RCU flavor to concurrently execute its
rcu_barrier() function, it is necessary to move the relevant
state to the rcu_state structure. This commit therefore moves the
rcu_barrier_completion global variable to a new ->barrier_completion
field in the rcu_state structure.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In order to allow each RCU flavor to concurrently execute its rcu_barrier()
function, it is necessary to move the relevant state to the rcu_state
structure. This commit therefore moves the rcu_barrier_cpu_count global
variable to a new ->barrier_cpu_count field in the rcu_state structure.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In order for multiple flavors of RCU to each concurrently run one
rcu_barrier(), each flavor needs its own per-CPU set of rcu_head
structures. This commit therefore moves _rcu_barrier()'s set of
per-CPU rcu_head structures from per-CPU variables to the existing
per-CPU and per-RCU-flavor rcu_data structures.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This is a preparatory commit for increasing rcu_barrier()'s concurrency.
It adds a pointer in the rcu_data structure to the corresponding call_rcu()
function. This allows a pointer to the rcu_data structure to imply the
function pointer, which allows _rcu_barrier() state to be placed in the
rcu_state structure.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Upcoming rcu_barrier() concurrency commits will result in line lengths
greater than 80 characters in the RCU_STATE_INITIALIZER(), so this commit
shortens the name of the macro's argument to prevent this.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu_node tree array is sized based on compile-time constants,
including NR_CPUS. Although this approach has worked well in the past,
the recent trend by many distros to define NR_CPUS=4096 results in
excessive grace-period-initialization latencies.
This commit therefore substitutes the run-time computed nr_cpu_ids for
the compile-time NR_CPUS when building the tree. This can result in
much of the compile-time-allocated rcu_node array being unused. If
this is a major problem, you are in a specialized situation anyway,
so you can manually adjust the NR_CPUS, RCU_FANOUT, and RCU_FANOUT_LEAF
kernel config parameters.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Time to make the four-level-hierarchy setting less scary, so this
commit removes "Experimental" from the boot-time message. Leave the
message in order to get a heads-up on any possible need to expand to
a five-level hierarchy.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although making RCU_FANOUT_LEAF a kernel configuration parameter rather
than a fixed constant makes it easier for people to decrease cache-miss
overhead for large systems, it is of little help for people who must
run a single pre-built kernel binary.
This commit therefore allows the value of RCU_FANOUT_LEAF to be
increased (but not decreased!) via a boot-time parameter named
rcutree.rcu_fanout_leaf.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This reverts commit 616c310e83.
(Move PREEMPT_RCU preemption to switch_to() invocation).
Testing by Sasha Levin <levinsasha928@gmail.com> showed that this
can result in deadlock due to invoking the scheduler when one of
the runqueue locks is held. Because this commit was simply a
performance optimization, revert it.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
With the introduction of suspend to both into in-kernel hibernation
code, dmesg was getting polluted with backspace characters printed as
part of image saving progress indicator. This patch introduces printing
of progress indicator on image save/load every 10% and one line at a
time. As an additional benefit, all other messages emitted by the kernel
during hibernation/thaw should now print cleanly as well.
Signed-off-by: Bojan Smojver <bojan@rexursive.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Change the behavior of the newly introduced
/sys/power/pm_print_times attribute so that its initial value
depends on initcall_debug, but setting it to 0 will cause device
suspend/resume times not to be printed, even if initcall_debug has
been set. This way, the people who use initcall_debug for reasons
other than PM debugging will be able to switch the suspend/resume
times printing off, if need be.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Added a new knob called /sys/power/pm_print_times. Setting it to 1
enables printing of time taken by devices to suspend and resume.
Setting it to 0 disables this printing (unless overridden by
initcall_debug kernel command line option).
Signed-off-by: Sameer Nanda <snanda@chromium.org>
Acked-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
If function tracing is enabled for some of the low-level suspend/resume
functions, it leads to triple fault during resume from suspend, ultimately
ending up in a reboot instead of a resume (or a total refusal to come out
of suspended state, on some machines).
This issue was explained in more detail in commit f42ac38c59 (ftrace:
disable tracing for suspend to ram). However, the changes made by that commit
got reverted by commit cbe2f5a6e8 (tracing: allow tracing of
suspend/resume & hibernation code again). So, unfortunately since things are
not yet robust enough to allow tracing of low-level suspend/resume functions,
suspend/resume is still broken when ftrace is enabled.
So fix this by disabling function tracing during suspend/resume & hibernation.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
It is often useful to suspend to memory after hibernation image has been
written to disk. If the battery runs out or power is otherwise lost, the
computer will resume from the hibernated image. If not, it will resume
from memory and hibernation image will be discarded.
Signed-off-by: Bojan Smojver <bojan@rexursive.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Fix kernel-doc warnings in printk.c: use correct parameter name.
Warning(kernel/printk.c:2429): No description found for parameter 'buf'
Warning(kernel/printk.c:2429): Excess function parameter 'line' description in 'kmsg_dump_get_buffer'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is a number of printk() fixes, specifically a few reported by the
crazy blog program that ships in SUSE releases (that's "boot log" and
not "web log", it predates the general "blog" terminology by many
years), and the restoration of the continuation line functionality
reported by Stephen and others. Yes, the changes seem a bit big this
late in the cycle, but I've been beating on them for a while now, and
Stephen has even optimized it a bit, so all looks good to me.
The other change in here is a Documentation update for the stable kernel
rules describing how some distro patches should be backported, to
hopefully drive a bit more response from the distros to the stable
kernel releases.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAk/uhJEACgkQMUfUDdst+ymL0QCfTjWJrdf+ooJ6Bx/NNgOGxYip
Ss0AnRrCNkfgmMcdNMn/7CIbHlaTj+S+
=M5Rg
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver Core fixes from Greg Kroah-Hartman:
"Here is a number of printk() fixes, specifically a few reported by the
crazy blog program that ships in SUSE releases (that's "boot log" and
not "web log", it predates the general "blog" terminology by many
years), and the restoration of the continuation line functionality
reported by Stephen and others. Yes, the changes seem a bit big this
late in the cycle, but I've been beating on them for a while now, and
Stephen has even optimized it a bit, so all looks good to me.
The other change in here is a Documentation update for the stable
kernel rules describing how some distro patches should be backported,
to hopefully drive a bit more response from the distros to the stable
kernel releases.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
printk: Optimize if statement logic where newline exists
printk: flush continuation lines immediately to console
syslog: fill buffer with more than a single message for SYSLOG_ACTION_READ
Revert "printk: return -EINVAL if the message len is bigger than the buf size"
printk: fix regression in SYSLOG_ACTION_CLEAR
stable: Allow merging of backports for serious user-visible performance issues
This patch adds the following structure:
struct netlink_kernel_cfg {
unsigned int groups;
void (*input)(struct sk_buff *skb);
struct mutex *cb_mutex;
};
That can be passed to netlink_kernel_create to set optional configurations
for netlink kernel sockets.
I've populated this structure by looking for NULL and zero parameters at the
existing code. The remaining parameters that always need to be set are still
left in the original interface.
That includes optional parameters for the netlink socket creation. This allows
easy extensibility of this interface in the future.
This patch also adapts all callers to use this new interface.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In reviewing Kay's fix up patch: "printk: Have printk() never buffer its
data", I found two if statements that could be combined and optimized.
Put together the two 'cont.len && cont.owner == current' if statements
into a single one, and check if we need to call cont_add(). This also
removes the unneeded double cont_flush() calls.
Link: http://lkml.kernel.org/r/1340869133.876.10.camel@mop
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When removing pages from the ring buffer, its state is not reset. This
means that the counters need to be correctly updated to account for the
pages removed.
Update the overrun counter to reflect the removed events from the pages.
Link: http://lkml.kernel.org/r/1340998301-1715-1-git-send-email-vnagarnaik@google.com
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The new_pages list head in the cpu_buffer is not initialized. When
adding pages to the ring buffer, if the memory allocation fails in
ring_buffer_resize, the clean up handler tries to free up the allocated
pages from all the cpu buffers. The panic is caused by referencing the
uninitialized new_pages list head.
Initializing the new_pages list head in rb_allocate_cpu_buffer fixes
this.
Link: http://lkml.kernel.org/r/1340391005-10880-1-git-send-email-vnagarnaik@google.com
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Continuation lines are buffered internally, intended to merge the
chunked printk()s into a single record, and to isolate potentially
racy continuation users from usual terminated line users.
This though, has the effect that partial lines are not printed to
the console in the moment they are emitted. In case the kernel
crashes in the meantime, the potentially interesting printed
information would never reach the consoles.
Here we share the continuation buffer with the console copy logic,
and partial lines are always immediately flushed to the available
consoles. They are still buffered internally to improve the
readability and integrity of the messages and minimize the amount
of needed record headers to store.
Signed-off-by: Kay Sievers <kay@vrfy.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Conflicts:
drivers/net/caif/caif_hsi.c
drivers/net/usb/qmi_wwan.c
The qmi_wwan merge was trivial.
The caif_hsi.c, on the other hand, was not. It's a conflict between
1c385f1fdf ("caif-hsi: Replace platform
device with ops structure.") in the net-next tree and commit
39abbaef19 ("caif-hsi: Postpone init of
HIS until open()") in the net tree.
I did my best with that one and will ask Sjur to check it out.
Signed-off-by: David S. Miller <davem@davemloft.net>
The ring buffer reader page is used to swap a page from the writable
ring buffer. If the writer happens to be on that page, it ends up on the
reader page, but will simply move off of it, back into the writable ring
buffer as writes are added.
The time stamp passed back to the readers is stored in the cpu_buffer per
CPU descriptor. This stamp is updated when a swap of the reader page takes
place, and it reads the current stamp from the page taken from the writable
ring buffer. Everytime a writer goes to a new page, it updates the time stamp
of that page.
The problem happens if a reader reads a page from an empty per CPU ring buffer.
If the buffer is empty, the swap still takes place, placing the writer at the
start of the reader page. If at a later time, a write happens, it updates the
page's time stamp and continues. But the problem is that the read_stamp does
not get updated, because the page was already swapped.
The solution to this was to not swap the page if the ring buffer happens to
be empty. This also removes the side effect that the writes on the reader
page will not get updated because the writer never gets back on the reader
page without a swap. That is, if a read happens on an empty buffer, but then
no reads happen for a while. If a swap took place, and the writer were to start
writing a lot of data (function tracer), it will start overflowing the ring buffer
and overwrite the older data. But because the writer never goes back onto the
reader page, the data left on the reader page never gets overwritten. This
causes the reader to see really old data, followed by a jump to newer data.
Link: http://lkml.kernel.org/r/1340060577-9112-1-git-send-email-dhsharp@google.com
Google-Bug-Id: 6410455
Reported-by: David Sharp <dhsharp@google.com>
tested-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Replace the NR_CPUS array of buffer_iter from the trace_iterator
with an allocated array. This will just create an array of
possible CPUS instead of the max number specified.
The use of NR_CPUS in that array caused allocation failures for
machines that were tight on memory. This did not cause any failures
to the system itself (no crashes), but caused unnecessary failures
for reading the trace files.
Added a helper function called 'trace_buffer_iter()' that returns
the buffer_iter item or NULL if it is not defined or the array was
not allocated. Some routines do not require the array
(tracing_open_pipe() for one).
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a WARN_ON() output on test failures so that they are easier to detect
in automated tests. Although, the WARN_ON() will not print if the test
causes the system to crash, obviously.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The recent changes to the printk buffer management resulted in
SYSLOG_ACTION_READ to only return a single message, whereas previously
the buffer would get filled as much as possible. As, when too small to
fit everything, filling it to the last byte would be pretty ugly with
the new code, the patch arranges for as many messages as possible to
get returned in a single invocation. User space tools in at least all
SLES versions depend on the old behavior.
This at once addresses the issue attempted to get fixed with commit
b56a39ac26 ("printk: return -EINVAL if
the message len is bigger than the buf size"), and since that commit
widened the possibility for losing a message altogether, the patch
here assumes that this other commit would get reverted first
(otherwise the patch here won't apply).
Furthermore, this patch also addresses the problem dealt with in
commit 4a77a5a06e ("printk: use mutex
lock to stop syslog_seq from going wild"), so I'd recommend reverting
that one too (albeit there's no direct collision between the two).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kay Sievers <kay@vrfy.org>
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit b56a39ac26.
A better patch from Jan will follow this to resolve the issue.
Acked-by: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <wfg@linux.intel.com>
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit b1420f1c (Make rcu_barrier() less disruptive) rearranged the
code in rcu_do_batch(), moving the ->qlen manipulation to follow
the requeueing of the callbacks. Unfortunately, this rearrangement
clobbered the value of the "count" local variable before the value
of rdp->qlen was adjusted, resulting in the value of rdp->qlen being
inaccurate. This commit therefore introduces an index variable "i",
avoiding the inadvertent multiplexing.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Commit 7ff9554bb5 (printk: convert
byte-buffer to variable-length record buffer) introduced a regression
by accidentally removing a "break" statement from inside the big
switch in printk's do_syslog(). The symptom of this bug is that the
"dmesg -C" command doesn't only clear the kernel's log buffer; it also
disables console logging.
This patch (as1561) fixes the regression by adding the missing
"break".
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This change is meant to add a control for disabling early socket demux.
The main motivation behind this patch is to provide an option to disable
the feature as it adds an additional cost to routing that reduces overall
throughput by up to 5%. For example one of my systems went from 12.1Mpps
to 11.6 after the early socket demux was added. It looks like the reason
for the regression is that we are now having to perform two lookups, first
the one for an established socket, and then the one for the routing table.
By adding this patch and toggling the value for ip_early_demux to 0 I am
able to get back to the 12.1Mpps I was previously seeing.
[ Move local variables in ip_rcv_finish() down into the basic
block in which they are actually used. -DaveM ]
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull perf updates from Ingo Molnar.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ftrace: Make all inline tags also include notrace
perf: Use css_tryget() to avoid propping up css refcount
perf tools: Fix synthesizing tracepoint names from the perf.data headers
perf stat: Fix default output file
perf tools: Fix endianity swapping for adds_features bitmask
Pull two cgroup fixes from Tejun Heo:
"This containes two patches fixing a refcnt race bug during css_put().
Decrementing and checking the value weren't atomic and two tasks could
think that they both pushed the counter to zero."
* 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroups: Account for CSS_DEACT_BIAS in __css_put
cgroup: make sure that decisions in __css_put are atomic
Here are some fixes for 3.5-rc4 that resolve the kmsg problems that
people have reported showing up after the printk and kmsg changes went
into 3.5-rc1. There are also a smattering of other tiny fixes for the
extcon and hyper-v drivers that people have reported.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAk/iNQcACgkQMUfUDdst+yklTQCfZCXFlhA43bZo/8Joqd2pLIIW
2uoAoMze0SlfJeN6Qu7yY0P+qV/f/pc3
=UNFY
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and printk fixes from Greg Kroah-Hartman:
"Here are some fixes for 3.5-rc4 that resolve the kmsg problems that
people have reported showing up after the printk and kmsg changes went
into 3.5-rc1. There are also a smattering of other tiny fixes for the
extcon and hyper-v drivers that people have reported.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
extcon: max8997: Add missing kfree for info->edev in max8997_muic_remove()
extcon: Set platform drvdata in gpio_extcon_probe() and fix irq leak
extcon: Fix wrong index in max8997_extcon_cable[]
kmsg - kmsg_dump() fix CONFIG_PRINTK=n compilation
printk: return -EINVAL if the message len is bigger than the buf size
printk: use mutex lock to stop syslog_seq from going wild
kmsg - kmsg_dump() use iterator to receive log buffer content
vme: change maintainer e-mail address
Extcon: Don't try to create duplicate link names
driver core: fixup reversed deferred probe order
printk: Fix alignment of buf causing crash on ARM EABI
Tools: hv: verify origin of netlink connector message
During merging of PR_GET_TID_ADDRESS patch the code has been misplaced (it
happened to appear under PR_MCE_KILL) in result noone can use this option.
Fix it by moving code snippet to a proper place.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_new_reaper() changes pid_ns->child_reaper, see add0d4df ("pid_ns:
zap_pid_ns_processes: fix the ->child_reaper changing").
The original reason has gone away after the previous patch, ->children
list must be empty after zap_pid_ns_processes().
However now we can not switch to init_pid_ns.child_reaper.
__unhash_process() relies on the "->child_reaper == parent" check, but
this check does not work if the last exiting task is also the child
reaper.
As Eric sugested, we can change __unhash_process() to use the parent's
pid_ns and remove this code.
Also, with this change we can move detach_pid(PIDTYPE_PID) back, where it
was before the previous fix.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Tested-by: Andrew Wagin <avagin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Today we have a twofold bug. Sometimes release_task on pid == 1 in a pid
namespace can run before other processes in a pid namespace have had
release task called. With the result that pid_ns_release_proc can be
called before the last proc_flus_task() is done using upid->ns->proc_mnt,
resulting in the use of a stale pointer. This same set of circumstances
can lead to waitpid(...) returning for a processes started with
clone(CLONE_NEWPID) before the every process in the pid namespace has
actually exited.
To fix this modify zap_pid_ns_processess wait until all other processes in
the pid namespace have exited, even EXIT_DEAD zombies.
The delay_group_leader and related tests ensure that the thread gruop
leader will be the last thread of a process group to be reaped, or to
become EXIT_DEAD and self reap. With the change to zap_pid_ns_processes
we get the guarantee that pid == 1 in a pid namespace will be the last
task that release_task is called on.
With pid == 1 being the last task to pass through release_task
pid_ns_release_proc can no longer be called too early nor can wait return
before all of the EXIT_DEAD tasks in a pid namespace have exited.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Tested-by: Andrew Wagin <avagin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does
put_user(clear_child_tid) which can update task->rss_stat and thus make
mm->rss_stat inconsistent. This triggers the "BUG:" printk in check_mm().
Let's fix this bug in the safest way, and optimize/cleanup this later.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we fixed the race between atomic_dec and css_refcnt, we missed
the fact that css_refcnt internally subtracts CSS_DEACT_BIAS to get
the actual reference count. This can potentially cause a refcount leak
if __css_put races with cgroup_clear_css_refs.
Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Originally from Peter Zijlstra. The helper migrates perf events
from one cpu to another cpu.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-5-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allow the pmu->event_init callback to change event->cpu, so the PMU driver
can choose the CPU on which to install events.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-4-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf_event_open() requires the cpu on which to install event is online,
but the cpu can go offline after perf_event_open checks that. Add a
get_online_cpus()/put_online_cpus() pair to avoid the race.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-3-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
An rmdir pushes css's ref count to zero. However, if the associated
directory is open at the time, the dentry ref count is non-zero. If
the fd for this directory is then passed into perf_event_open, it
does a css_get(). This bounces the ref count back up from zero. This
is a problem by itself. But what makes it turn into a crash is the
fact that we end up doing an extra dput, since we perform a dput
when css_put sees the ref count go down to zero.
css_tryget() does not fall into that trap. So, we use that instead.
Reproduction test-case for the bug:
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/unistd.h>
#include <linux/perf_event.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
#define PERF_FLAG_PID_CGROUP (1U << 2)
int perf_event_open(struct perf_event_attr *hw_event_uptr,
pid_t pid, int cpu, int group_fd, unsigned long flags) {
return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu,
group_fd, flags);
}
/*
* Directly poke at the perf_event bug, since it's proving hard to repro
* depending on where in the kernel tree. what moved?
*/
int main(int argc, char **argv)
{
int fd;
struct perf_event_attr attr;
memset(&attr, 0, sizeof(attr));
attr.exclude_kernel = 1;
attr.size = sizeof(attr);
mkdir("/dev/cgroup/perf_event/blah", 0777);
fd = open("/dev/cgroup/perf_event/blah", O_RDONLY);
perror("open");
rmdir("/dev/cgroup/perf_event/blah");
sleep(2);
perf_event_open(&attr, fd, 0, -1, PERF_FLAG_PID_CGROUP);
perror("perf_event_open");
close(fd);
return 0;
}
Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20120614223108.1025.2503.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There isn't a really compelling reason to force ->map to be populated,
so allow it to be left unset.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
Just like what devkmsg_read() does, return -EINVAL if the message len is
bigger than the buf size, or it will trigger a segfault error.
Acked-by: Kay Sievers <kay@vrfy.org>
Acked-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Although syslog_seq and log_next_seq stuff are protected by logbuf_lock
spin log, it's not enough. Say we have two processes A and B, and let
syslog_seq = N, while log_next_seq = N + 1, and the two processes both
come to syslog_print at almost the same time. And No matter which
process get the spin lock first, it will increase syslog_seq by one,
then release spin lock; thus later, another process increase syslog_seq
by one again. In this case, syslog_seq is bigger than syslog_next_seq.
And latter, it would make:
wait_event_interruptiable(log_wait, syslog != log_next_seq)
don't wait any more even there is no new write comes. Thus it introduce
a infinite loop reading.
I can easily see this kind of issue by the following steps:
# cat /proc/kmsg # at meantime, I don't kill rsyslog
# So they are the two processes.
# xinit # I added drm.debug=6 in the kernel parameter line,
# so that it will produce lots of message and let that
# issue happen
It's 100% reproducable on my side. And my disk will be filled up by
/var/log/messages in a quite short time.
So, introduce a mutex_lock to stop syslog_seq from going wild just like
what devkmsg_read() does. It does fix this issue as expected.
v2: use mutex_lock_interruptiable() instead (comments from Kay)
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-By: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1. __copy_insn() needs "loff_t offset", not "unsigned long",
to read the file.
2. use pgoff_t for "idx" and remove the unnecessary typecast.
3. fix the typo, "&=" is not what we want
4. can't resist, rename off1 to off.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154359.GA9625@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
loff_t looks confusing when it is used for the virtual address.
Change map_info and install_breakpoint/remove_breakpoint paths
to use "unsigned long".
The patch doesn't change vma_address(), it can't return "long"
because it is used to verify the mapping. But probably this
needs some cleanups too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154355.GA9622@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
uprobe->pending_list is only used to create the temporary list,
it has no meaning after we drop uprobes_mmap_hash(inode).
No need to initialize this node or remove it from tmp_list, and
we can use list_for_each_entry().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120615154353.GA9614@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
write_opcode() ensures that UPROBE_SWBP_INSN doesn't cross the
page boundary. This looks a bit confusing, the check does not
depend on vaddr and it is enough to do it only once right after
install_breakpoint()->arch_uprobe_analyze_insn().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154350.GA9611@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
write_opcode() is called by register_for_each_vma() and
uprobe_mmap() paths. In both cases the caller has already
verified this vaddr under mmap_sem, no need to re-check.
Note also that this check is wrong anyway, we should not
truncate loff_t returned by vma_address() if we do not trust
this mapping.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154347.GA9604@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
copy_insn() returns -ENOMEM if the first __copy_insn() fails,
it should return the correct error code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154344.GA9601@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. copy_insn() doesn't need "addr", it can use uprobe->offset.
Remove this argument.
2. Change copy_insn/__copy_insn to accept "struct file*" instead
of vma.
copy_insn() is called only once and mm/vma/vaddr are random, it
shouldn't depend on them.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154342.GA9598@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because the mind is treacherous and makes us forget we need to
write stuff down.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120615154339.GA9591@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
build_map_info() doesn't allocate the memory under i_mmap_mutex
to avoid the deadlock with page reclaim. But it can try
GFP_NOWAIT first, it should work in the likely case and thus we
almost never need the pre-alloc-and-retry path.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Link: http://lkml.kernel.org/r/20120615154336.GA9588@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently register_for_each_vma() is O(n ** 2) + O(n ** 3),
every time find_next_vma_info() "restarts" the
vma_prio_tree_foreach() loop and each iteration rechecks the
whole try_list. This also means that try_list can grow
"indefinitely" if register/unregister races with munmap/mmap
activity even if the number of mapping is bounded at any time.
With this patch register_for_each_vma() builds the list of
mm/vaddr structures only once and does install_breakpoint() for
each entry.
We do not care about the new mappings which can be created after
build_map_info() drops mapping->i_mmap_mutex, uprobe_mmap()
should do its work.
Note that we do not allocate map_info under i_mmap_mutex, this
can deadlock with page reclaim (but see the next patch). So we
use 2 lists, "curr" which we are going to return, and "prev"
which holds the already allocated memory. The main loop deques
the entry from "prev" (initially it is empty), and if "prev"
becomes empty again it counts the number of entries we need to
pre-allocate outside of i_mmap_mutex.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Link: http://lkml.kernel.org/r/20120615154333.GA9581@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
install_breakpoint() returns -EEXIST if is_swbp_insn(orig_insn)
== T, the caller treats this code as success.
This is doubly wrong. The successful return should set
UPROBE_COPY_INSN, but the real problem is that it shouldn't
succeed. If the probed insn is int3 the application should get
SIGTRAP, this won't happen with uprobe.
Probably we can fix this, we can add the UPROBE_SHARED_BP flag
and teach handle_swbp/set_orig_insn to handle this case
correctly. But this needs some complications and we have other
insns which can't be probed, lets make a simple fix for now.
I think this needs a cleanup. UPROBE_COPY_INSN should die,
copy_insn() should be called by alloc_uprobe().
arch_uprobe_analyze_insn() depends on ->mm (ia32_compat) but it
is called only once.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154331.GA9578@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
write_opcode() gets old_page via get_user_pages() and then calls
__replace_page() which assumes that this old_page is still
mapped after pte_offset_map_lock().
This is not true if this old_page was already try_to_unmap()'ed,
and in this case everything __replace_page() does with old_page
is wrong. Just for example, put_page() is not balanced.
I think it is possible to teach __replace_page() to handle this
unlikely case correctly, but this patch simply changes it to use
page_check_address() and return -EAGAIN if it fails. The caller
should notice this error code and retry.
Note: write_opcode() asks for the cleanups, I'll try to do this
in a separate patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154328.GA9571@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__copy_insn() blindly calls read_mapping_page(), this will crash
the kernel if ->readpage == NULL, add the necessary check. For
example, hugetlbfs_aops->readpage is NULL. Perhaps we should
change read_mapping_page() instead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154325.GA9568@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__replace_page() obviously can't work with the hugetlbfs
mappings, uprobe_register() will likely crash the kernel. Change
valid_vma() to check VM_HUGETLB as well.
As for PageTransHuge() no need to worry, vma->vm_file != NULL.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154322.GA9561@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf fixes from Ingo Molnar.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
watchdog: Quiet down the boot messages
perf/x86: Fix broken LBR fixup code
tracing: Have tracing_off() actually turn tracing off
Pull core updates (RCU and locking) from Ingo Molnar:
"Most of the diffstat comes from the RCU slow boot regression fixes,
but there's also a debuggability improvements/fixes."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
memblock: Document memblock_is_region_{memory,reserved}()
rcu: Precompute RCU_FAST_NO_HZ timer offsets
rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure
rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks
rcu: RCU_FAST_NO_HZ detection of callback adoption
spinlock: Indicate that a lockup is only suspected
kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop()
panic: Make panic_on_oops configurable
Provide an iterator to receive the log buffer content, and convert all
kmsg_dump() users to it.
The structured data in the kmsg buffer now contains binary data, which
should no longer be copied verbatim to the kmsg_dump() users.
The iterator should provide reliable access to the buffer data, and also
supports proper log line-aware chunking of data while iterating.
Signed-off-by: Kay Sievers <kay@vrfy.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Reported-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Tested-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Where irq_domain_associate() is called in irq_create_mapping, there is
no need to test for IRQ_DOMAIN_MAP_LEGACY because it is already tested
for earlier in the routine.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
While common irqdesc allocation is node aware, the irqdomain code is not.
Presently we observe a number of regressions/inconsistencies on
NUMA-capable platforms:
- Platforms using irqdomains with legacy mappings, where the
irq_descs are allocated node-local and the irqdomain data
structure is not.
- Drivers implementing irqdomains will lose node locality
regardless of the underlying struct device's node id.
This plugs in NUMA node id proliferation across the various allocation
callsites by way of_node_to_nid() node lookup. While of_node_to_nid()
does the right thing for OF-capable platforms it doesn't presently handle
the non-DT case. This is trivially dealt with by simply wraping in to
numa_node_id() unconditionally.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
The pattern (np ? np->full_name : "<none>") is rather common in the
kernel, but can also make for quite long lines. This patch adds a new
inline function, of_node_full_name() so that the test for a valid node
pointer doesn't need to be open coded at all call sites.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
All trace events including ftrace internel events (like trace_printk
and function tracing), register functions that describe how to print
their output. The events may be recorded as soon as the ring buffer
is allocated, but they are just raw binary in the buffer. The mapping
of event ids to how to print them are held within a structure that
is registered on system boot.
If a crash happens in boot up before these functions are registered
then their output (via ftrace_dump_on_oops) will be useless:
Dumping ftrace buffer:
---------------------------------
<...>-1 0.... 319705us : Unknown type 6
---------------------------------
This can be quite frustrating for a kernel developer trying to see
what is going wrong.
There's no reason to register them so late in the boot up process.
They can be registered by early_initcall().
Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
register_ftrace_function() checks ftrace_disabled and calls
__register_ftrace_function which does it again.
Drop the first check and add the unlikely hint to the second one. Also,
drop the label as John correctly notices.
No functional change.
Link: http://lkml.kernel.org/r/20120329171140.GE6409@aftab
Cc: Borislav Petkov <bp@amd64.org>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A bunch of bugzillas have complained how noisy the nmi_watchdog
is during boot-up especially with its expected failure cases
(like virt and bios resource contention).
This is my attempt to quiet them down and keep it less confusing
for the end user. What I did is print the message for cpu0 and
save it for future comparisons. If future cpus have an
identical message as cpu0, then don't print the redundant info.
However, if a future cpu has a different message, happily print
that loudly.
Before the change, you would see something like:
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping 0a
Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
... version: 2
... bit width: 40
... generic registers: 2
... value mask: 000000ffffffffff
... max period: 000000007fffffff
... fixed-purpose events: 3
... event mask: 0000000700000003
NMI watchdog enabled, takes one hw-pmu counter.
Booting Node 0, Processors #1
NMI watchdog enabled, takes one hw-pmu counter.
#2
NMI watchdog enabled, takes one hw-pmu counter.
#3 Ok.
NMI watchdog enabled, takes one hw-pmu counter.
Brought up 4 CPUs
Total of 4 processors activated (22607.24 BogoMIPS).
After the change, it is simplified to:
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping 0a
Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
... version: 2
... bit width: 40
... generic registers: 2
... value mask: 000000ffffffffff
... max period: 000000007fffffff
... fixed-purpose events: 3
... event mask: 0000000700000003
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
Booting Node 0, Processors #1#2#3 Ok.
Brought up 4 CPUs
V2: little changes based on Joe Perches' feedback
V3: printk cleanup based on Ingo's feedback; checkpatch fix
V4: keep printk as one long line
V5: Ingo fix ups
Reported-and-tested-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: nzimmer@sgi.com
Cc: joe@perches.com
Link: http://lkml.kernel.org/r/1339594548-17227-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a resource has no parent, allow its start/end to be set arbitrarily
as long as any children are still contained within the new range.
[bhelgaas: changelog]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
by splice_shrink_spd() called from vmsplice_to_pipe()
commit 35f3d14dbb (pipe: add support for shrinking and growing pipes)
added capability to adjust pipe->buffers.
Problem is some paths don't hold pipe mutex and assume pipe->buffers
doesn't change for their duration.
Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
use it in place of pipe->buffers where appropriate.
splice_shrink_spd() loses its struct pipe_inode_info argument.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tom Herbert <therbert@google.com>
Cc: stable <stable@vger.kernel.org> # 2.6.35
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 7ff9554bb5, printk: convert
byte-buffer to variable-length record buffer, causes systems using
EABI to crash very early in the boot cycle. The first entry in struct
log is a u64, which for EABI must be 8 byte aligned.
Make use of __alignof__() so the compiler to decide the alignment, but
allow it to be overridden using CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS,
for systems which can perform unaligned access and want to save
a few bytes of space.
Tested on Orion5x and Kirkwood.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The next idle expiry time record and idle sleeps tracking are
statistics that only concern idle.
Since we want the nohz APIs to become usable further idle
context, let's pull up the handling of these statistics to the
callers in idle.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Since we want to prepare for making the nohz API to work further
the idle case, we need to pull ts->idle_calls incrementation up to
the callers in idle.
To perform this, we split tick_nohz_stop_sched_tick() in two parts:
a first one that checks if we can really stop the tick for idle,
and another that actually stops it. Then from the callers in idle,
we check if we can stop the tick and only then we increment idle_calls
and finally relay to the nohz API that won't care about these details
anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Now that idle and nohz logics are going to be independant each others,
ts->idle_tick becomes too much a biased name to describe the field that
saves the last scheduled tick on top of which we re-calculate the next
tick to schedule when the timer is restarted.
We want to reuse this even to stop the tick outside idle cases. So let's
rename it to some more generic name: ts->last_tick.
This changes a bit the timer list stat export so we need to increase its
version.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
When the timer tick fires, it accounts the new jiffy as either part
of system, user or idle time. This is how we record the cputime
statistics.
But when the tick is stopped from the idle task, we still need
to record the number of jiffies spent tickless until we restart
the tick and fall back to traditional tick-based cputime accounting.
To do this, we take a snapshot of jiffies when the tick is stopped
and compute the difference against the new value of jiffies when
the tick is restarted. Then we account this whole difference to
the idle cputime.
However we are preparing to be able to stop the tick from other places
than idle. So this idle time accounting needs to be performed from
the callers of nohz APIs, not from the nohz APIs themselves because
we now want them to be agnostic against places that stop/restart tick.
Therefore, we pull the tickless idle time accounting out of generic
nohz helpers up to idle entry/exit callers.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
As we plan to be able to stop the tick outside the idle task, we
need to prepare for separating nohz logic from idle. As a start,
this pulls the idle sleeping time accounting out of the tick
stop/restart API to the callers on idle entry/exit.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Merge RCU fixes from Paul E. McKenney:
" This series has four patches, the major point of which is to eliminate
some slowdowns (including boot-time slowdowns) resulting from some
RCU_FAST_NO_HZ changes. The issue with the changes is that posting timers
from the idle loop has no effect if the CPU has entered dyntick-idle
mode because the CPU has already computed its wakeup time, and posting
a timer does not cause it to be recomputed. The short-term fix is for
RCU to precompute the timeout value so that the CPU's calculation is
correct. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix lots of new kernel-doc warnings in kernel/sched/fair.c:
Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
.. more warnings
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull perf fixes from Ingo Molnar:
"A bit larger than what I'd wish for - half of it is due to hw driver
updates to Intel Ivy-Bridge which info got recently released,
cycles:pp should work there now too, amongst other things. (but we
are generally making exceptions for hardware enablement of this type.)
There are also callchain fixes in it - responding to mostly
theoretical (but valid) concerns. The tooling side sports perf.data
endianness/portability fixes which did not make it for the merge
window - and various other fixes as well."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
perf/x86: Check user address explicitly in copy_from_user_nmi()
perf/x86: Check if user fp is valid
perf: Limit callchains to 127
perf/x86: Allow multiple stacks
perf/x86: Update SNB PEBS constraints
perf/x86: Enable/Add IvyBridge hardware support
perf/x86: Implement cycles:p for SNB/IVB
perf/x86: Fix Intel shared extra MSR allocation
x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
perf: Remove duplicate invocation on perf_event_for_each
perf uprobes: Remove unnecessary check before strlist__delete
perf symbols: Check for valid dso before creating map
perf evsel: Fix 32 bit values endianity swap for sample_id_all header
perf session: Handle endianity swap on sample_id_all header data
perf symbols: Handle different endians properly during symbol load
perf evlist: Pass third argument to ioctl explicitly
perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
perf tools: Make --version show kernel version instead of pull req tag
perf tools: Check if callchain is corrupted
perf callchain: Make callchain cursors TLS
...
Pull leap second timer fix from Thomas Gleixner.
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
This reverts commit 40af1bbdca.
It's horribly and utterly broken for at least the following reasons:
- calling sync_mm_rss() from mmput() is fundamentally wrong, because
there's absolutely no reason to believe that the task that does the
mmput() always does it on its own VM. Example: fork, ptrace, /proc -
you name it.
- calling it *after* having done mmdrop() on it is doubly insane, since
the mm struct may well be gone now.
- testing mm against NULL before you call it is insane too, since a
NULL mm there would have caused oopses long before.
.. and those are just the three bugs I found before I decided to give up
looking for me and revert it asap. I should have caught it before I
even took it, but I trusted Andrew too much.
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm->rss_stat counters have per-task delta: task->rss_stat. Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().
do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats,
audit and other stuff. Unfortunately the kernel does this before
calling mm_release(), which can call put_user() for processing
task->clear_child_tid. So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier. After mm_release() there should
be no pagefaults.
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org> [3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") the stack allocated via clone() is marked in
/proc/<pid>/maps as [stack:%d] thus it might be out of the former
mm->start_stack/end_stack values (and even has some custom VMA flags
set).
So to be able to restore mm->start_stack/end_stack drop vma flags test,
but still require the underlying VMA to exist.
As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
requires CAP_SYS_RESOURCE to be granted.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zero is written at clear_tid_address when the process exits. This
functionality is used by pthread_join().
We already have sys_set_tid_address() to change this address for the
current task but there is no way to obtain it from user space.
Without the ability to find this address and dump it we can't restore
pthread'ed apps which call pthread_join() once they have been restored.
This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
the current process to obtain own clear_tid_address.
This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.
[akpm@linux-foundation.org: fix prctl numbering]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A fix for commit b32dfe3771 ("c/r: prctl: add ability to set new
mm_struct::exe_file").
After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until
final mmput(), it never becomes NULL while task is alive.
We can check for other mapped files in mm instead of checking
mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in
order to forbid second changing of mm->exe_file.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
The RCU_FAST_NO_HZ code relies on a number of per-CPU variables.
This works, but is hidden from someone scanning the data structures
in rcutree.h. This commit therefore converts these per-CPU variables
to fields in the per-CPU rcu_dynticks structures.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
In the current code, a short dyntick-idle interval (where there is
at least one non-lazy callback on the CPU) and a long dyntick-idle
interval (where there are only lazy callbacks on the CPU) are traced
identically, which can be less than helpful. This commit therefore
emits different event traces in these two cases.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
In the present implementations of CPU hotplug, the outgoing CPU is
guaranteed to run its stop-machine process on the way out, which
will guarantee that RCU_FAST_NO_HZ forces the CPU out of dyntick-idle
mode.
However, new versions of CPU hotplug might not work this way. This
commit therefore removes this design constraint by explicitly notifying
CPUs when they adopt non-lazy RCU callbacks.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
A recent update to have tracing_on/off() only affect the ftrace ring
buffers instead of all ring buffers had a cut and paste error.
The tracing_off() did the exact same thing as tracing_on() and
would not actually turn off tracing. Unfortunately, tracing_off()
is more important to be working than tracing_on() as this is a key
development tool, as it lets the developer turn off tracing as soon
as a problem is discovered. It is also used by panic and oops code.
This bug also breaks the 'echo func:traceoff > set_ftrace_filter'
Cc: <stable@vger.kernel.org> # 3.4
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It was introduced for memcg to iterate cgroup hierarchy without
holding cgroup_mutex, but soon after that it was replaced with
a lockless way in memcg.
No one used hierarchy_mutex since that, so remove it.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
__css_put is using atomic_dec on the ref count, and then
looking at the ref count to make decisions. This is prone
to races, as someone else may decrement ref count between
our decrement and our decision. Instead, we should base our
decisions on the value that we decremented the ref count to.
(This results in an actual race on Google's kernel which I
haven't been able to reproduce on the upstream kernel. Having
said that, it's still incorrect by inspection).
Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Kill the no longer needed uprobes_srcu/uprobe_srcu_id code.
It doesn't really work anyway. synchronize_srcu() can only
synchronize with the code "inside" the
srcu_read_lock/srcu_read_unlock section, while
uprobe_pre_sstep_notifier() does srcu_read_lock() _after_ we
already hit the breakpoint.
I guess this probably works "in practice". synchronize_srcu() is
slow and it implies synchronize_sched(), and the probed task
enters the non- preemptible section at the start of exception
handler. Still this is not right at least in theory, and
task->uprobe_srcu_id blows task_struct.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529193008.GG8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently handle_swbp() assumes that it can't race with
unregister, so it roughly does:
if (find_uprobe(vaddr))
process_uprobe();
else
send_sig(SIGTRAP);
This relies on the not-really-working uprobes_srcu code we are
going to remove, see the next patch.
With this patch we rely on the result of
is_swbp_at_addr(bp_vaddr) if find_uprobe() fails.
If is_swbp == 1, then we hit the normal int3, we should send
SIGTRAP.
If is_swbp == 0, we raced with uprobe_unregister(), we simply
restart this insn again.
The "difficult" case is is_swbp == -EFAULT, when we can't read
this memory. In this case I think we should restart too, and
this is more correct compared to the current code which sends
SIGTRAP.
Ignoring ENOMEM/etc from get_user_pages(), this can only happen
if another thread unmaps this memory before find_active_uprobe()
takes mmap_sem. It would be better to pretend it was unmapped
before this insn was executed, restart, and get SIGSEGV.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192947.GF8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change register_for_each_vma() to take mm->mmap_sem for writing.
This is a bit unfortunate but hopefully not too bad, this is the
slow path anyway.
This is needed to ensure that find_active_uprobe() can not race
with uprobe_register() which adds the new bp at the same
bp_vaddr, after find_uprobe() fails and before
is_swbp_at_addr_fast() checks the memory.
IOW, this is needed to ensure that if find_active_uprobe()
returns NULL but is_swbp == true, we can safely assume that it
was the "normal" int3 and we should send SIGTRAP.
There is another reason for this change. We are going to replace
uprobes_state->count with MMF_ flags set by register/unregister
and cleared by find_active_uprobe(), and set/clear shouldn't
race with each other.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192928.GE8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A separate patch to simplify the review, and for the
documentation.
The patch adds another "int *is_swbp" argument to
find_active_uprobe(), so far its only caller doesn't use this
info.
With this patch find_active_uprobe() additionally does:
- if find_vma() + ->vm_start check fails, *is_swbp = -EFAULT
- otherwise, if valid_vma() + find_uprobe() fails, it holds
the result of is_swbp_at_addr(), can be negative too. The
latter is only possible if we raced with another thread
which did munmap/etc after we hit this bp.
IOW. If find_active_uprobe(&is_swbp) returns NULL, the caller
can look at is_swbp to figure out whether the current insn is bp
or not, or detect the race with another thread if it is
negative.
Note: I think that performance-wise this change is fine. This
adds is_swbp_at_addr(), but only if we raced with
uprobe_unregister() or if we hit the "normal" int3 but this mm
has uprobes as well. And even in this case the slow
read_opcode() path is very unlikely, this insn recently
triggered do_int3(), __copy_from_user_inatomic() shouldn't fail
in the likely case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192914.GD8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No functional changes. Move the "find uprobe" code from
handle_swbp() to the new helper, find_active_uprobe().
Note: with or without this change, the find-active-uprobe logic
is not exactly right. We can race with another thread which
unmaps the memory with the valid uprobe before we take
mm->mmap_sem. We can't find this uprobe simply because
find_vma() fails. In this case we wrongly assume that this trap
was not caused by uprobe and send the erroneous SIGTRAP. See the
next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192857.GC8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
set_orig_insn()->read_opcode() should not fail if the probed
task did mprotect() after uprobe_register(), change it to use
FOLL_FORCE. Without FOLL_WRITE this doesn't have any "side"
effect but allows to read the !VM_READ memory.
There is another reason for this change, we are going to use
is_swbp_at_addr() from handle_swbp() which can race with another
thread doing mprotect().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192759.GB8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change is_swbp_at_addr() to try to avoid the costly
read_opcode() if mm == current->mm, __copy_from_user_inatomic()
should succeed in the likely case.
Currently this optimization is not important, but we are going
to add more is_swbp_at_addr(current->mm) callers.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192744.GA8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It does not get processed because sched_domain_level_max is 0 at the
time that setup_relax_domain_level() is run.
Simply accept the value as it is, as we don't know the value of
sched_domain_level_max until sched domain construction is completed.
Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
the set_domain_attribute() routine prior to setting the sd->level, however,
the set_domain_attribute() routine relies on the sd->level to decide whether
idle load balancing will be off/on.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add some code to validate assumptions we're making and output
warnings if they are not.
If this trigger we want to know about it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alex Shi <lkml.alex@gmail.com>
Link: http://lkml.kernel.org/n/tip-6uc3wk5s9udxtdl9cnku0vtt@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Often when we run into mis-shapen topologies the balance iteration
fails to update the cpu power properly and we'll end up in /0 traps.
Always initialize the cpu-power to a semi-sane value so that we can
at least boot the machine, even if the load-balancer might not
function correctly.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Weird topologies can lead to asymmetric domain setups. This needs
further consideration since these setups are typically non-minimal
too.
For now, make it work by adding an extra mask selecting which CPUs
are allowed to iterate up.
The topology that triggered it is the one from David Rientjes:
10 20 20 30
20 10 20 20
20 20 10 20
30 20 20 10
resulting in boxes that wouldn't even boot.
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Roland Dreier reported spurious, hard to trigger lockdep warnings
within the scheduler - without any real lockup.
This bit gives us the right clue:
> [89945.640512] [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90
> [89945.640568] [<ffffffff8104c546>] push_rt_task+0xc6/0x290
if you look at that code you'll find the double_lock_balance() in
question is the one in find_lock_lowest_rq() [yay for inlining].
Now find_lock_lowest_rq() has a bug.. it fails to use
double_unlock_balance() in one exit path, if this results in a retry in
push_rt_task() we'll call double_lock_balance() again, at which point
we'll run into said lockdep confusion.
Reported-by: Roland Dreier <roland@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit cb83b629b ("sched/numa: Rewrite the CONFIG_NUMA sched
domain support") removed the NODE sched domain and started checking
if the node distance in SLIT table is farther than REMOTE_DISTANCE,
if so, it will lose the load balance chance at exec/fork/wake_affine
points.
But actually, even the node distance is farther than REMOTE_DISTANCE.
Modern CPUs also has QPI like connections, which ensures that memory
access is not too slow between nodes. So the above change in behavior
on NUMA machine causes a performance regression on various benchmarks:
hackbench, tbench, netperf, oltp, etc.
This patch will recover the scheduler behavior to old mode on all my
Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the
perfromance regressions. (all of them just have 2 kinds distance, 10, 21)
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Gilad reported at
http://lkml.kernel.org/r/1336056962-10465-2-git-send-email-gilad@benyossef.com
"Current timer code fails to correctly return a value meaning that
there is no future timer event, with the result that the timer keeps
getting re-armed in HZ one shot mode even when we could turn it off,
generating unneeded interrupts.
What is happening is that when __next_timer_interrupt() wishes
to return a value that signifies "there is no future timer
event", it returns (base->timer_jiffies + NEXT_TIMER_MAX_DELTA).
However, the code in tick_nohz_stop_sched_tick(), which called
__next_timer_interrupt() via get_next_timer_interrupt(),
compares the return value to (last_jiffies + NEXT_TIMER_MAX_DELTA)
to see if the timer needs to be re-armed.
base->timer_jiffies != last_jiffies and so tick_nohz_stop_sched_tick()
interperts the return value as indication that there is a distant
future event 12 days from now and programs the timer to fire next
after KTIME_MAX nsecs instead of avoiding to arm it. This ends up
causing a needless interrupt once every KTIME_MAX nsecs."
Fix this by using the new active timer accounting. This avoids scans
when no active timer is enqueued completely, so we don't have to rely
on base->timer_next and base->timer_jiffies anymore.
Reported-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.317535385@linutronix.de
The code in get_next_timer_interrupt() is suboptimal as it has to run
through the cascade to find the next expiring timer. On a completely
idle core we should only do that when there is an active timer
enqueued and base->next_timer does not give us a fast answer.
Add accounting of the active timers to the now consolidated
attach/detach code. I deliberately avoided sanity checks because the
code is fully symetric and any fiddling with timers w/o using the API
functions will lead to cute explosions anyway. ulong is big enough
even on 32bit and if we really run into the situation to have more
than 1<<32 timers enqueued there, then we are definitely not in a
state to go idle and run through that code.
This allows us to fix another shortcoming of get_next_timer_interrupt().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.236377028@linutronix.de
Another bunch of mindlessly copied code. All callers of
internal_add_timer() except the recascading code updates
base->next_timer.
Move this into internal_add_timer() and let the cascading code call
__internal_add_timer().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.189946224@linutronix.de
Most callers of detach_timer() have the same pattern around
them. Check whether the timer is pending and eventually updating
base->next_timer.
Create detach_if_pending() and replace the duplicated code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.131246037@linutronix.de
Pull perf fixes from Arnaldo Carvalho de Melo:
* Endianness fixes from Jiri Olsa
* Fixes for make perf tarball
* Fix for DSO name in perf script callchains, from David Ahern
* Segfault fixes for perf top --callchain, from Namhyung Kim
* Minor function result fixes from Srikar Dronamraju
* Add missing 3rd ioctl parameter, from Namhyung Kim
* Fix pager usage in minimal embedded systems, from Avik Sil
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull cgroup fix from Tejun Heo:
"This fixes the possible premature superblock release on umount bug
mentioned during v3.5-rc1 pull request.
Originally, cgroup dentry destruction path assumed that cgroup dentry
didn't have any reference left after cgroup removal thus put super
during dentry removal. Now that there can be lingering dentry
references, this led to super being put with live dentries. This
patch fixes the problem by putting super ref on dentry release instead
of removal."
* 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: superblock can't be released with active dentries
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Remove NULL assignment of dattr_cur
sched: Remove the last NULL entry from sched_feat_names
sched: Make sched_feat_names const
sched/rt: Fix SCHED_RR across cgroups
sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'
sched: Make sure to not re-read variables after validation
sched: Fix SD_OVERLAP
sched: Don't try allocating memory from offline nodes
sched/nohz: Fix rq->cpu_load calculations some more
sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask
Commit 6b43ae8a61 (ntp: Fix leap-second hrtimer livelock) broke the
leapsecond update of CLOCK_MONOTONIC. The missing leapsecond update to
wall_to_monotonic causes discontinuities in CLOCK_MONOTONIC.
Adjust wall_to_monotonic when NTP inserted a leapsecond.
Reported-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Tested-by: Richard Cochran <richardcochran@gmail.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1338400497-12420-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull irq and smpboot updates from Thomas Gleixner:
"Just cleanup patches with no functional change and a fix for suspend
issues."
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Introduce irq_do_set_affinity() to reduce duplicated code
genirq: Add IRQS_PENDING for nested and simple irq
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smpboot, idle: Fix comment mismatch over idle_threads_init()
smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()
Pull timer updates from Thomas Gleixner:
"The clocksource driver is pure hardware enablement and the skew option
is default off, well tested and non dangerous."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Move skew_tick option into the HIGH_RES_TIMER section
clocksource: em_sti: Add DT support
clocksource: em_sti: Emma Mobile STI driver
clockevents: Make clockevents_config() a global symbol
tick: Add tick skew boot option
Pull third pile of signal handling patches from Al Viro:
"This time it's mostly helpers and conversions to them; there's a lot
of stuff remaining in the tree, but that'll either go in -rc2
(isolated bug fixes, ideally via arch maintainers' trees) or will sit
there until the next cycle."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
x86: get rid of calling do_notify_resume() when returning to kernel mode
blackfin: check __get_user() return value
whack-a-mole with TIF_FREEZE
FRV: Optimise the system call exit path in entry.S [ver #2]
FRV: Shrink TIF_WORK_MASK [ver #2]
FRV: Prevent syscall exit tracing and notify_resume at end of kernel exceptions
new helper: signal_delivered()
powerpc: get rid of restore_sigmask()
most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set
set_restore_sigmask() is never called without SIGPENDING (and never should be)
TIF_RESTORE_SIGMASK can be set only when TIF_SIGPENDING is set
don't call try_to_freeze() from do_signal()
pull clearing RESTORE_SIGMASK into block_sigmask()
sh64: failure to build sigframe != signal without handler
openrisc: tracehook_signal_handler() is supposed to be called on success
new helper: sigmask_to_save()
new helper: restore_saved_sigmask()
new helpers: {clear,test,test_and_clear}_restore_sigmask()
HAVE_RESTORE_SIGMASK is defined on all architectures now
Pull vfs changes from Al Viro.
"A lot of misc stuff. The obvious groups:
* Miklos' atomic_open series; kills the damn abuse of
->d_revalidate() by NFS, which was the major stumbling block for
all work in that area.
* ripping security_file_mmap() and dealing with deadlocks in the
area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
general.
* ->encode_fh() switched to saner API; insane fake dentry in
mm/cleancache.c gone.
* assorted annotations in fs (endianness, __user)
* parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
* ->update_time() work from Josef.
* other bits and pieces all over the place.
Normally it would've been in two or three pull requests, but
signal.git stuff had eaten a lot of time during this cycle ;-/"
Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
nfs: don't open in ->d_revalidate
vfs: retry last component if opening stale dentry
vfs: nameidata_to_filp(): don't throw away file on error
vfs: nameidata_to_filp(): inline __dentry_open()
vfs: do_dentry_open(): don't put filp
vfs: split __dentry_open()
vfs: do_last() common post lookup
vfs: do_last(): add audit_inode before open
vfs: do_last(): only return EISDIR for O_CREAT
vfs: do_last(): check LOOKUP_DIRECTORY
vfs: do_last(): make ENOENT exit RCU safe
vfs: make follow_link check RCU safe
vfs: do_last(): use inode variable
vfs: do_last(): inline walk_component()
vfs: do_last(): make exit RCU safe
vfs: split do_lookup()
Btrfs: move over to use ->update_time
fs: introduce inode operation ->update_time
reiserfs: get rid of resierfs_sync_super
reiserfs: mark the superblock as dirty a bit later
...
Does block_sigmask() + tracehook_signal_handler(); called when
sigframe has been successfully built. All architectures converted
to it; block_sigmask() itself is gone now (merged into this one).
I'm still not too happy with the signature, but that's a separate
story (IMO we need a structure that would contain signal number +
siginfo + k_sigaction, so that get_signal_to_deliver() would fill one,
signal_delivered(), handle_signal() and probably setup...frame() -
take one).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only 3 out of 63 do not. Renamed the current variant to __set_current_blocked(),
added set_current_blocked() that will exclude unblockable signals, switched
open-coded instances to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Everyone either defines it in arch thread_info.h or has TIF_RESTORE_SIGMASK
and picks default set_restore_sigmask() in linux/thread_info.h. Kill the
ifdefs, slap #error in linux/thread_info.h to catch breakage when new ones
get merged.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull second pile of signal handling patches from Al Viro:
"This one is just task_work_add() series + remaining prereqs for it.
There probably will be another pull request from that tree this
cycle - at least for helpers, to get them out of the way for per-arch
fixes remaining in the tree."
Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
had brought in commit 97fd75b7b8 ("kernel/irq/manage.c: use the
pr_foo() infrastructure to prefix printks") which changed one of the
pr_err() calls that this merge moves around.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
keys: kill task_struct->replacement_session_keyring
keys: kill the dummy key_replace_session_keyring()
keys: change keyctl_session_to_parent() to use task_work_add()
genirq: reimplement exit_irq_thread() hook via task_work_add()
task_work_add: generic process-context callbacks
avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
parisc: need to check NOTIFY_RESUME when exiting from syscall
move key_repace_session_keyring() into tracehook_notify_resume()
TIF_NOTIFY_RESUME is defined on all targets now
Merge misc patches from Andrew Morton:
- the "misc" tree - stuff from all over the map
- checkpatch updates
- fatfs
- kmod changes
- procfs
- cpumask
- UML
- kexec
- mqueue
- rapidio
- pidns
- some checkpoint-restore feature work. Reluctantly. Most of it
delayed a release. I'm still rather worried that we don't have a
clear roadmap to completion for this work.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (78 patches)
kconfig: update compression algorithm info
c/r: prctl: add ability to set new mm_struct::exe_file
c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
syscalls, x86: add __NR_kcmp syscall
fs, proc: introduce /proc/<pid>/task/<tid>/children entry
sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
eventfd: change int to __u64 in eventfd_signal()
fs/nls: add Apple NLS
pidns: make killed children autoreap
pidns: use task_active_pid_ns in do_notify_parent
rapidio/tsi721: add DMA engine support
rapidio: add DMA engine support for RIO data transfers
ipc/mqueue: add rbtree node caching support
tools/selftests: add mq_perf_tests
ipc/mqueue: strengthen checks on mqueue creation
ipc/mqueue: correct mq_attr_ok test
ipc/mqueue: improve performance of send/recv
selftests: add mq_open_tests
...
When we do restore we would like to have a way to setup a former
mm_struct::exe_file so that /proc/pid/exe would point to the original
executable file a process had at checkpoint time.
For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a
file descriptor which will be set as a source for new /proc/$pid/exe
symlink.
Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
vmas present for current process, simply because this feature is a special
to C/R and mm::num_exe_file_vmas become meaningless after that.
To minimize the amount of transition the /proc/pid/exe symlink might have,
this feature is implemented in one-shot manner. Thus once changed the
symlink can't be changed again. This should help sysadmins to monitor the
symlinks over all process running in a system.
In particular one could make a snapshot of processes and ring alarm if
there unexpected changes of /proc/pid/exe's in a system.
Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
request to change symlink will be rejected.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During checkpoint we dump whole process memory to a file and the dump
includes process stack memory. But among stack data itself, the stack
carries additional parameters such as command line arguments, environment
data and auxiliary vector.
So when we do restore procedure and once we've restored stack data itself
we need to setup mm_struct::arg_start/end, env_start/end, so restored
process would be able to find command line arguments and environment data
it had at checkpoint time. The same applies to auxiliary vector.
For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
ENV_END | AUXV) codes are introduced.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.
The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.
One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.
Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate. So here is it --
__NR_kcmp.
It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.
Lookups for pids are done in the caller's PID namespace only.
At moment only x86 is supported and tested.
[akpm@linux-foundation.org: fix up selftests, warnings]
[akpm@linux-foundation.org: include errno.h]
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: Michal Marek <mmarek@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For those who doesn't need C/R functionality there is no need to control
last pid, ie the pid for the next fork() call.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Force SIGCHLD handling to SIG_IGN so that signals are not generated and so
that the children autoreap. This increases the parallelize and in general
the speed of network namespace shutdown.
Note self reaping childrean can exist past zap_pid_ns_processess but they
will all be reaped before we allow the pid namespace init task with pid ==
1 to be reaped.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using task_active_pid_ns is more robust because it works even after we
have called exit_namespaces. This change allows us to have parent
processes that are zombies. Normally a zombie parent processes is crazy
and the last thing you would want to have but in the case of not letting
the init process of a pid namespace be reaped until all of it's children
are dead and reaped a zombie parent process is exactly what we want.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add more comments on clear_tasks_mm_cpumask, plus adds a runtime check:
the function is only suitable for offlined CPUs, and if called
inappropriately, the kernel should scream aloud.
[akpm@linux-foundation.org: tweak comment: s/walks up/walks/, use 80 cols]
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many architectures clear tasks' mm_cpumask like this:
read_lock(&tasklist_lock);
for_each_process(p) {
if (p->mm)
cpumask_clear_cpu(cpu, mm_cpumask(p->mm));
}
read_unlock(&tasklist_lock);
Depending on the context, the code above may have several problems,
such as:
1. Working with task->mm w/o getting mm or grabing the task lock is
dangerous as ->mm might disappear (exit_mm() assigns NULL under
task_lock(), so tasklist lock is not enough).
2. Checking for process->mm is not enough because process' main
thread may exit or detach its mm via use_mm(), but other threads
may still have a valid mm.
This patch implements a small helper function that does things
correctly, i.e.:
1. We take the task's lock while whe handle its mm (we can't use
get_task_mm()/mmput() pair as mmput() might sleep);
2. To catch exited main thread case, we use find_lock_task_mm(),
which walks up all threads and returns an appropriate task
(with task lock held).
Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in
the new helper, instead we take the rcu read lock. We can do this
because the function is called after the cpu is taken down and marked
offline, so no new tasks will get this cpu set in their mm mask.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Child should wake up the parent from vfork() only after finishing all
operations with shared mm. There is no sense in using
CLONE_CHILD_CLEARTID together with CLONE_VFORK, but it looks more accurate
now.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In embedded systems, sometimes the same program (busybox) is the cause of
multiple warnings. Outputting the pid with the program name in the
warning printk helps distinguish which instances of a program are using
the stack most.
This is a small patch, but useful.
Signed-off-by: Tim Bird <tim.bird@am.sony.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8f92054e7c ("CRED: Fix __task_cred()'s lockdep check and banner
comment"):
add the following validation condition:
task->exit_state >= 0
to permit the access if the target task is dead and therefore
unable to change its own credentials.
OK, but afaics currently this can only help wait_task_zombie() which calls
__task_cred() without rcu lock.
Remove this validation and change wait_task_zombie() to use task_uid()
instead. This means we do rcu_read_lock() only to shut up the lockdep,
but we already do the same in, say, wait_task_stopped().
task_is_dead() should die, task->exit_state != 0 means that this task has
passed exit_notify(), only do_wait-like code paths should use this.
Unfortunately, we can't kill task_is_dead() right now, it has already
acquired buggy users in drivers/staging. The fix already exists.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Warning(kernel/kmod.c:419): No description found for parameter 'depth'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we move call_usermodehelper_fns() to kmod.c file and EXPORT_SYMBOL it
we can avoid exporting all it's helper functions:
call_usermodehelper_setup
call_usermodehelper_setfns
call_usermodehelper_exec
And make all of them static to kmod.c
Since the optimizer will see all these as a single call site it will
inline them inside call_usermodehelper_fns(). So we loose the call to
_fns but gain 3 calls to the helpers. (Not that it matters)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both kernel/sys.c && security/keys/request_key.c where inlining the exact
same code as call_usermodehelper_fns(); So simply convert these sites to
directly use call_usermodehelper_fns().
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
call_usermodehelper_freeinfo() is not used outside of kmod.c. So unexport
it, and make it static to kmod.c
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the module-wide pr_fmt() mechanism rather than open-coding "genirq: "
everywhere.
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sethostname() and setdomainname() notify userspace on failure (without
modifying uts_kern_table). Change things so that we only notify userspace
on success, when uts_kern_table was actually modified.
Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: WANG Cong <amwang@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the comment of allocate_resource(), the explanation of parameter max
and min is not correct.
Actually, these two parameters are used to specify the range of the
resource that will be allocated, not the min/max size that will be
allocated.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The @func callback was invoked twice for group leader when
perf_event_for_each() called. It seems the commit 75f937f24b
("perf_counter: Fix ctx->mutex vs counter ->mutex inversion") made the
mistake during the change.
Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338443506-25009-1-git-send-email-namhyung.kim@lge.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull perf updates from Ingo Molnar.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
perf ui browser: Stop using 'self'
perf annotate browser: Read perf config file for settings
perf config: Allow '_' in config file variable names
perf annotate browser: Make feature toggles global
perf annotate browser: The idx_asm field should be used in asm only view
perf tools: Convert critical messages to ui__error()
perf ui: Make --stdio default when TUI is not supported
tools lib traceevent: Silence compiler warning on 32bit build
perf record: Fix branch_stack type in perf_record_opts
perf tools: Reconstruct event with modifiers from perf_event_attr
perf top: Fix counter name fixup when fallbacking to cpu-clock
perf tools: fix thread_map__new_by_pid_str() memory leak in error path
perf tools: Do not use _FORTIFY_SOURCE when DEBUG=1 is specified
tools lib traceevent: Fix signature of create_arg_item()
tools lib traceevent: Use proper function parameter type
tools lib traceevent: Fix freeing arg on process_dynamic_array()
tools lib traceevent: Fix a possibly wrong memory dereference
tools lib traceevent: Fix a possible memory leak
tools lib traceevent: Allow expressions in __print_symbolic() fields
perf evlist: Explicititely initialize input_name
...
Merge block/IO core bits from Jens Axboe:
"This is a bit bigger on the core side than usual, but that is purely
because we decided to hold off on parts of Tejun's submission on 3.4
to give it a bit more time to simmer. As a consequence, it's seen a
long cycle in for-next.
It contains:
- Bug fix from Dan, wrong locking type.
- Relax splice gifting restriction from Eric.
- A ton of updates from Tejun, primarily for blkcg. This improves
the code a lot, making the API nicer and cleaner, and also includes
fixes for how we handle and tie policies and re-activate on
switches. The changes also include generic bug fixes.
- A simple fix from Vivek, along with a fix for doing proper delayed
allocation of the blkcg stats."
Fix up annoying conflict just due to different merge resolution in
Documentation/feature-removal-schedule.txt
* 'for-3.5/core' of git://git.kernel.dk/linux-block: (92 commits)
blkcg: tg_stats_alloc_lock is an irq lock
vmsplice: relax alignement requirements for SPLICE_F_GIFT
blkcg: use radix tree to index blkgs from blkcg
blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
block: fix elvpriv allocation failure handling
block: collapse blk_alloc_request() into get_request()
blkcg: collapse blkcg_policy_ops into blkcg_policy
blkcg: embed struct blkg_policy_data in policy specific data
blkcg: mass rename of blkcg API
blkcg: style cleanups for blk-cgroup.h
blkcg: remove blkio_group->path[]
blkcg: blkg_rwstat_read() was missing inline
blkcg: shoot down blkgs if all policies are deactivated
blkcg: drop stuff unused after per-queue policy activation update
blkcg: implement per-queue policy activation
blkcg: add request_queue->root_blkg
blkcg: make request_queue bypassing on allocation
blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing
blkcg: make blkg_conf_prep() take @pol and return with queue lock held
blkcg: remove static policy ID enums
...
No need to have the last NULL entry.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FBF29E7.5020805@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The strings sched_feat_names are never changed.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FBF29B2.9030904@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_tick_rt() has an optimization to only reschedule SCHED_RR tasks
if they were the only element on their rq. However, with cgroups
a SCHED_RR task could be the only element on its per-cgroup rq but
still be competing with other SCHED_RR tasks in its parent's
cgroup. In this case, the SCHED_RR task in the child cgroup would
never yield at the end of its timeslice. If the child cgroup
rt_runtime_us was the same as the parent cgroup rt_runtime_us,
the task in the parent cgroup would starve completely.
Modify task_tick_rt() to check that the task is the only task on its
rq, and that the each of the scheduling entities of its ancestors
is also the only entity on its rq.
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since nr_cpus_allowed is used outside of sched/rt.c and wants to be
used outside of there more, move it to a more natural site.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
SD_OVERLAP exists to allow overlapping groups, overlapping groups
appear in NUMA topologies that aren't fully connected.
The typical result of not fully connected NUMA is that each cpu (or
rather node) will have different spans for a particular distance.
However due to how sched domains are traversed -- only the first cpu
in the mask goes one level up -- the next level only cares about the
spans of the cpus that went up.
Due to this two things were observed to be broken:
- build_overlap_sched_groups() -- since its possible the cpu we're
building the groups for exists in multiple (or all) groups, the
selection criteria of the first group didn't ensure there was a cpu
for which is was true that cpumask_first(span) == cpu. Thus load-
balancing would terminate.
- update_group_power() -- assumed that the cpu span of the first
group of the domain was covered by all groups of the child domain.
The above explains why this isn't true, so deal with it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allocators don't appreciate it when you try and allocate memory from
offline nodes.
Reported-and-tested-by: Tony Luck <tony.luck@intel.com>
Reported-and-tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-epfc1io9whb7o22bcujf31vn@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.
Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.
So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge back Linus's latest branch so that we pick up the uprobes changes.
( I tested this branch locally and while it's one from the middle of the
merge window it's a good one to base further work off. )
Signed-off-by: Ingo Molnar <mingo@kernel.org>
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
Since there are at least two users it makes sense to share this code in a
library. This is also easier maintainable than a macro forest.
This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Using %ps in a printk format will sometimes fail silently and print the
empty string if the address passed in does not match a symbol that
kallsyms knows about. But using %pS will fall back to printing the full
address if kallsyms can't find the symbol. Make %ps act the same as %pS
by falling back to printing the address.
While we're here also make %ps print the module that a symbol comes from
so that it matches what %pS already does. Take this simple function for
example (in a module):
static void test_printk(void)
{
int test;
pr_info("with pS: %pS\n", &test);
pr_info("with ps: %ps\n", &test);
}
Before this patch:
with pS: 0xdff7df44
with ps:
After this patch:
with pS: 0xdff7df44
with ps: 0xdff7df44
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When killing a res_counter which is a child of other counter, we need to
do
res_counter_uncharge(child, xxx)
res_counter_charge(parent, xxx)
This is not atomic and wastes CPU. This patch adds
res_counter_uncharge_until(). This function's uncharge propagates to
ancestors until specified res_counter.
res_counter_uncharge_until(child, parent, xxx)
Now the operation is atomic and efficient.
Signed-off-by: Frederic Weisbecker <fweisbec@redhat.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Library functions should not grab locks when the callsites can do it,
even if the lock nests like the rcu read-side lock does.
Push the rcu_read_lock() from css_is_ancestor() to its single user,
mem_cgroup_same_or_subtree() in preparation for another user that may
already hold the rcu read-side lock.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vma length in dup_mmap is calculated and stored in a unsigned int,
which is insufficient and hence overflows for very large maps (beyond
16TB). The following program demonstrates this:
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#define GIG 1024 * 1024 * 1024L
#define EXTENT 16393
int main(void)
{
int i, r;
void *m;
char buf[1024];
for (i = 0; i < EXTENT; i++) {
m = mmap(NULL, (size_t) 1 * 1024 * 1024 * 1024L,
PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
if (m == (void *)-1)
printf("MMAP Failed: %d\n", m);
else
printf("%d : MMAP returned %p\n", i, m);
r = fork();
if (r == 0) {
printf("%d: successed\n", i);
return 0;
} else if (r < 0)
printf("FORK Failed: %d\n", r);
else if (r > 0)
wait(NULL);
}
return 0;
}
Increase the storage size of the result to unsigned long, which is
sufficient for storing the difference between addresses.
Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The swap token code no longer fits in with the current VM model. It
does not play well with cgroups or the better NUMA placement code in
development, since we have only one swap token globally.
It also has the potential to mess with scalability of the system, by
increasing the number of non-reclaimable pages on the active and
inactive anon LRU lists.
Last but not least, the swap token code has been broken for a year
without complaints, as reported by Konstantin Khlebnikov. This suggests
we no longer have much use for it.
The days of sub-1G memory systems with heavy use of swap are over. If
we ever need thrashing reducing code in the future, we will have to
implement something that does scale.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Bob Picco <bpicco@meloft.net>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed. As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.
cgroup_create() does grab an active reference on the superblock to
prevent it from going away while there are !root cgroups; however, the
reference is put from cgroup_diput() which is invoked on cgroup
removal, so cgroup dentries which are removed but persisting due to
lingering csses already have released their superblock active refs
allowing superblock to be killed while those dentries are around.
Given the right condition, this makes cgroup_kill_sb() call
kill_litter_super() with dentries with non-zero d_count leading to
BUG() in shrink_dcache_for_umount_subtree().
Fix it by adding cgroup_dops->d_release() operation and moving
deactivate_super() to it. cgroup_diput() now marks dentry->d_fsdata
with itself if superblock should be deactivated and cgroup_d_release()
deactivates the superblock on dentry release.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
LKML-Reference: <CA+1xoqe5hMuxzCRhMy7J0XchDk2ZnuxOHJKikROk1-ReAzcT6g@mail.gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Let the user decide whether power consumption or jitter is the
more important consideration for their machines.
Quoting removal commit af5ab277de:
"Historically, Linux has tried to make the regular timer tick on the
various CPUs not happen at the same time, to avoid contention on
xtime_lock.
Nowadays, with the tickless kernel, this contention no longer happens
since time keeping and updating are done differently. In addition,
this skew is actually hurting power consumption in a measurable way on
many-core systems."
Problems:
- Contrary to the above, systems do encounter contention on both
xtime_lock and RCU structure locks when the tick is synchronized.
- Moderate sized RT systems suffer intolerable jitter due to the tick
being synchronized.
- SGI reports the same for their large systems.
- Fully utilized systems reap no power saving benefit from skew removal,
but do suffer from resulting induced lock contention.
- 0209f649 rcu: limit rcu_node leaf-level fanout
This patch was born to combat lock contention which testing showed
to have been _induced by_ skew removal. Skew the tick, contention
disappeared virtually completely.
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Link: http://lkml.kernel.org/r/1336472458.21924.78.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull KVM changes from Avi Kivity:
"Changes include additional instruction emulation, page-crossing MMIO,
faster dirty logging, preventing the watchdog from killing a stopped
guest, module autoload, a new MSI ABI, and some minor optimizations
and fixes. Outside x86 we have a small s390 and a very large ppc
update.
Regarding the new (for kvm) rebaseless workflow, some of the patches
that were merged before we switch trees had to be rebased, while
others are true pulls. In either case the signoffs should be correct
now."
Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h.
I suspect the kvm_para.h resolution ends up doing the "do I have cpuid"
check effectively twice (it was done differently in two different
commits), but better safe than sorry ;)
* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits)
KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block
KVM: s390: onereg for timer related registers
KVM: s390: epoch difference and TOD programmable field
KVM: s390: KVM_GET/SET_ONEREG for s390
KVM: s390: add capability indicating COW support
KVM: Fix mmu_reload() clash with nested vmx event injection
KVM: MMU: Don't use RCU for lockless shadow walking
KVM: VMX: Optimize %ds, %es reload
KVM: VMX: Fix %ds/%es clobber
KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()
KVM: VMX: unlike vmcs on fail path
KVM: PPC: Emulator: clean up SPR reads and writes
KVM: PPC: Emulator: clean up instruction parsing
kvm/powerpc: Add new ioctl to retreive server MMU infos
kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM
KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler
KVM: PPC: Book3S: Enable IRQs during exit handling
KVM: PPC: Fix PR KVM on POWER7 bare metal
KVM: PPC: Fix stbux emulation
KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields
...
Minor changes and fixups for irqdomain infrastructure. Most
important change adds ability to remove registered irqdomain.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPvpKWAAoJEEFnBt12D9kB6OIP/RvT1nU223w+hQMr8RKqhjZQ
GbhuZ1o2lPLAxoGYuCQqapeH542NuuIQvviM2MKLn6u9ev3fKeFZFdF4YTpE8fwo
ljIXJnjJ9reu2yCrAsIVZtIHJ+hXs07h1QjRwqFWGN/y57BRhxXI6xm1+SEAhay9
gtRnfQ7eCpi665zYoNBIoNxKeESrdiwgHKUsbkNELmbTwvx+Sc9AWsYMqtO0qRAG
JrOFCIOu3bqEcshDhM4MLZGVEmlzVZR4zUbQrY0chj5Y2c383YUyg+l+tN0NjPsF
L3MfgIu8WFim/edQM294dLTrZjqicg4xF5uRxjx5hY2EESoLKdf1pUady673M7ux
6cBcczMKJVQI7P2do7i8F0VwATkokytcP289hqYzJrxMHeXa48ccEZfiQt6xuLwc
JwWAZu3BxeBMzZNxQRNX39ImSsP5wnsfZdzUBTAFAcV1ZEYgSrGJYAw+pOz18UXD
YwwKcnNKwQHgNIkSLjgputT9VSuJsS09xErGeZAqkj7f6oxGlql9ElhUvgrBT8qg
eiTjgIkArRY6RG8+c2mMeKE10fN822jWK9kWQdttIPa++cSBHo/Yxt8PlClvEvH8
qjyD4nIG2dhwG8RtMc74IPDyHSHRW5JGXHPg37IoTPzurcsnzuNMSzlXVw2hb39d
pxhCVNxe1r4GH6NQFOMg
=K3Pg
-----END PGP SIGNATURE-----
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull irqdomain changes from Grant Likely:
"Minor changes and fixups for irqdomain infrastructure. The most
important change adds the ability to remove a registered irqdomain."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
irqdomain: Document size parameter of irq_domain_add_linear()
irqdomain: trivial pr_fmt conversion.
irqdomain: Kill off duplicate definitions.
irqdomain: Make irq_domain_simple_map() static.
irqdomain: Export remaining public API symbols.
irqdomain: Support removal of IRQ domains.
All invocations of chip->irq_set_affinity() are doing the same return
value checks. Let them all use a common function.
[ tglx: removed the silly likely while at it ]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Keping Chen <chenkeping@huawei.com>
Link: http://lkml.kernel.org/r/1333120296-13563-3-git-send-email-jiang.liu@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timer updates from Thomas Gleixner.
Various trivial conflict fixups in arch Kconfig due to addition of
unrelated entries nearby. And one slightly more subtle one for sparc32
(new user of GENERIC_CLOCKEVENTS), fixed up as per Thomas.
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
timekeeping: Fix a few minor newline issues.
time: remove obsolete declaration
ntp: Fix a stale comment and a few stray newlines.
ntp: Correct TAI offset during leap second
timers: Fixup the Kconfig consolidation fallout
x86: Use generic time config
unicore32: Use generic time config
um: Use generic time config
tile: Use generic time config
sparc: Use: generic time config
sh: Use generic time config
score: Use generic time config
s390: Use generic time config
openrisc: Use generic time config
powerpc: Use generic time config
mn10300: Use generic time config
mips: Use generic time config
microblaze: Use generic time config
m68k: Use generic time config
m32r: Use generic time config
...
Every interrupt which is an active wakeup source needs the ability to
abort suspend if there is a pending irq. Right now only edge and level
irqs can do that.
|
+---------+
| INTC |
+---------+
| GPIO_IRQ
+------------+
| gpio-exp |
+------------+
| |
GPIO0_IRQ GPIO1_IRQ
In the above diagram, gpio expander has irq number GPIO_IRQ, it is
connected with two sub GPIO pins, GPIO0 and GPIO1.
During suspend, we set IRQF_NO_SUSPEND for GPIO_IRQ so that gpio
expander driver can handle the sub irq GPIO0_IRQ and GPIO1_IRQ, and
these two irqs themselves can further be handled by simple or nested
irq in some drivers(typically gpio and mfd driver). If they are used
as wakeup sources during suspend, we want them to be able to abort
suspend too.
Setting IRQS_PENDING flag in handle_nested_irq() and handle_simple_irq()
when the irq is disabled allows check_wakeup_irqs() to identify such
irqs as source for aborting suspend.
Signed-off-by: Ning Jiang <ning.n.jiang@gmail.com>
Cc: rjw@sisk.pl
Link: http://lkml.kernel.org/r/CAH3Oq6T905%2B3fkF43NAMMFvJvq7dsk_so6T2vQ8ZJrA5xiU3YA@mail.gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull more networking updates from David Miller:
"Ok, everything from here on out will be bug fixes."
1) One final sync of wireless and bluetooth stuff from John Linville.
These changes have all been in his tree for more than a week, and
therefore have had the necessary -next exposure. John was just away
on a trip and didn't have a change to send the pull request until a
day or two ago.
2) Put back some defines in user exposed header file areas that were
removed during the tokenring purge. From Stephen Hemminger and Paul
Gortmaker.
3) A bug fix for UDP hash table allocation got lost in the pile due to
one of those "you got it.. no I've got it.." situations. :-)
From Tim Bird.
4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
try to coalesce overlapping frags and crash. Fix from Eric Dumazet.
5) RCU routing table lookups can race with free_fib_info(), causing
crashes when we deref the device pointers in the route. Fix by
releasing the net device in the RCU callback. From Yanmin Zhang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
tcp: take care of overlaps in tcp_try_coalesce()
ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
mm: add a low limit to alloc_large_system_hash
ipx: restore token ring define to include/linux/ipx.h
if: restore token ring ARP type to header
xen: do not disable netfront in dom0
phy/micrel: Fix ID of KSZ9021
mISDN: Add X-Tensions USB ISDN TA XC-525
gianfar:don't add FCB length to hard_header_len
Bluetooth: Report proper error number in disconnection
Bluetooth: Create flags for bt_sk()
Bluetooth: report the right security level in getsockopt
Bluetooth: Lock the L2CAP channel when sending
Bluetooth: Restore locking semantics when looking up L2CAP channels
Bluetooth: Fix a redundant and problematic incoming MTU check
Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
Bluetooth: Fix EIR data generation for mgmt_device_found
Bluetooth: Fix Inquiry with RSSI event mask
Bluetooth: improve readability of l2cap_seq_list code
Bluetooth: Fix skb length calculation
...
Pull user-space probe instrumentation from Ingo Molnar:
"The uprobes code originates from SystemTap and has been used for years
in Fedora and RHEL kernels. This version is much rewritten, reviews
from PeterZ, Oleg and myself shaped the end result.
This tree includes uprobes support in 'perf probe' - but SystemTap
(and other tools) can take advantage of user probe points as well.
Sample usage of uprobes via perf, for example to profile malloc()
calls without modifying user-space binaries.
First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.
If you don't know which function you want to probe you can pick one
from 'perf top' or can get a list all functions that can be probed
within libc (binaries can be specified as well):
$ perf probe -F -x /lib/libc.so.6
To probe libc's malloc():
$ perf probe -x /lib64/libc.so.6 malloc
Added new event:
probe_libc:malloc (on 0x7eac0)
You can now use it in all perf tools, such as:
perf record -e probe_libc:malloc -aR sleep 1
Make use of it to create a call graph (as the flat profile is going to
look very boring):
$ perf record -e probe_libc:malloc -gR make
[ perf record: Woken up 173 times to write data ]
[ perf record: Captured and wrote 44.190 MB perf.data (~1930712
$ perf report | less
32.03% git libc-2.15.so [.] malloc
|
--- malloc
29.49% cc1 libc-2.15.so [.] malloc
|
--- malloc
|
|--0.95%-- 0x208eb1000000000
|
|--0.63%-- htab_traverse_noresize
11.04% as libc-2.15.so [.] malloc
|
--- malloc
|
7.15% ld libc-2.15.so [.] malloc
|
--- malloc
|
5.07% sh libc-2.15.so [.] malloc
|
--- malloc
|
4.99% python-config libc-2.15.so [.] malloc
|
--- malloc
|
4.54% make libc-2.15.so [.] malloc
|
--- malloc
|
|--7.34%-- glob
| |
| |--93.18%-- 0x41588f
| |
| --6.82%-- glob
| 0x41588f
...
Or:
$ perf report -g flat | less
# Overhead Command Shared Object Symbol
# ........ ............. ............. ..........
#
32.03% git libc-2.15.so [.] malloc
27.19%
malloc
29.49% cc1 libc-2.15.so [.] malloc
24.77%
malloc
11.04% as libc-2.15.so [.] malloc
11.02%
malloc
7.15% ld libc-2.15.so [.] malloc
6.57%
malloc
...
The core uprobes design is fairly straightforward: uprobes probe
points register themselves at (inode:offset) addresses of
libraries/binaries, after which all existing (or new) vmas that map
that address will have a software breakpoint injected at that address.
vmas are COW-ed to preserve original content. The probe points are
kept in an rbtree.
If user-space executes the probed inode:offset instruction address
then an event is generated which can be recovered from the regular
perf event channels and mmap-ed ring-buffer.
Multiple probes at the same address are supported, they create a
dynamic callback list of event consumers.
The basic model is further complicated by the XOL speedup: the
original instruction that is probed is copied (in an architecture
specific fashion) and executed out of line when the probe triggers.
The XOL area is a single vma per process, with a fixed number of
entries (which limits probe execution parallelism).
The API: uprobes are installed/removed via
/sys/kernel/debug/tracing/uprobe_events, the API is integrated to
align with the kprobes interface as much as possible, but is separate
to it.
Injecting a probe point is privileged operation, which can be relaxed
by setting perf_paranoid to -1.
You can use multiple probes as well and mix them with kprobes and
regular PMU events or tracepoints, when instrumenting a task."
Fix up trivial conflicts in mm/memory.c due to previous cleanup of
unmap_single_vma().
* 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
perf probe: Detect probe target when m/x options are absent
perf probe: Provide perf interface for uprobes
tracing: Fix kconfig warning due to a typo
tracing: Provide trace events interface for uprobes
tracing: Extract out common code for kprobes/uprobes trace events
tracing: Modify is_delete, is_return from int to bool
uprobes/core: Decrement uprobe count before the pages are unmapped
uprobes/core: Make background page replacement logic account for rss_stat counters
uprobes/core: Optimize probe hits with the help of a counter
uprobes/core: Allocate XOL slots for uprobes use
uprobes/core: Handle breakpoint and singlestep exceptions
uprobes/core: Rename bkpt to swbp
uprobes/core: Make order of function parameters consistent across functions
uprobes/core: Make macro names consistent
uprobes: Update copyright notices
uprobes/core: Move insn to arch specific structure
uprobes/core: Remove uprobe_opcode_sz
uprobes/core: Make instruction tables volatile
uprobes: Move to kernel/events/
uprobes/core: Clean up, refactor and improve the code
...
Pull media updates from Mauro Carvalho Chehab:
- some V4L2 API updates needed by embedded devices
- DVB API extensions for ATSC-MH delivery system, used in US for mobile
TV
- new tuners for fc0011/0012/0013 and tua9001
- a new dvb driver for af9033/9035
- a new ATSC-MH frontend (lg2160)
- new remote controller keymaps
- Removal of a few legacy webcam driver that got replaced by gspca on
several kernel versions ago
- a new driver for Exynos 4/5 webcams(s5pp fimc-lite)
- a new webcam sensor driver (smiapp)
- a new video input driver for embedded (sta2x1xx)
- several improvements, fixes, cleanups, etc inside the drivers.
Manually fix up conflicts due to err() -> dev_err() conversion in
drivers/staging/media/easycap/easycap_main.c
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (484 commits)
[media] saa7134-cards: Remove a PCI entry added by mistake
[media] radio-sf16fmi: add support for SF16-FMD
[media] rc-loopback: remove duplicate line
[media] patch for Asus My Cinema PS3-100 (1043:48cd)
[media] au0828: Move the Kconfig knob under V4L_USB_DRIVERS
[media] em28xx: simple comment fix
[media] [resend] radio-sf16fmr2: add PnP support for SF16-FMD2
[media] smiapp: Use v4l2_ctrl_new_int_menu() instead of v4l2_ctrl_new_custom()
[media] smiapp: Add support for 8-bit uncompressed formats
[media] smiapp: Allow generic quirk registers
[media] smiapp: Use non-binning limits if the binning limit is zero
[media] smiapp: Initialise rval in smiapp_read_nvm()
[media] smiapp: Round minimum pre_pll up rather than down in ip_clk_freq check
[media] smiapp: Use 8-bit reads only before identifying the sensor
[media] smiapp: Quirk for sensors that only do 8-bit reads
[media] smiapp: Pass struct sensor to register writing commands instead of i2c_client
[media] smiapp: Allow using external clock from the clock framework
[media] zl10353: change .read_snr() to report SNR as a 0.1 dB
[media] media: add support to gspca/pac7302.c for 093a:2627 (Genius FaceCam 300)
[media] m88rs2000 - only flip bit 2 on reg 0x70 on 16th try
...
Pull an ftrace ring-buffer fix from Steve Rostedt:
* fix kernel crash when changing the size of the ring-buffer on
boxes where possible_cpus != online_cpus.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.
On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.
As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.
This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.
We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.
Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kill the no longer used task_struct->replacement_session_keyring, update
copy_creds() and exit_creds().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Smith <dsmith@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
exit_irq_thread() and task->irq_thread are needed to handle the unexpected
(and unlikely) exit of irq-thread.
We can use task_work instead and make this all private to
kernel/irq/manage.c, cleanup plus micro-optimization.
1. rename exit_irq_thread() to irq_thread_dtor(), make it
static, and move it up before irq_thread().
2. change irq_thread() to do task_work_add(irq_thread_dtor)
at the start and task_work_cancel() before return.
tracehook_notify_resume() can never play with kthreads,
only do_exit()->exit_task_work() can call the callback
and this is what we want.
3. remove task_struct->irq_thread and the special hook
in do_exit().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Smith <dsmith@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Provide a simple mechanism that allows running code in the (nonatomic)
context of the arbitrary task.
The caller does task_work_add(task, task_work) and this task executes
task_work->func() either from do_notify_resume() or from do_exit(). The
callback can rely on PF_EXITING to detect the latter case.
"struct task_work" can be embedded in another struct, still it has "void
*data" to handle the most common/simple case.
This allows us to kill the ->replacement_session_keyring hack, and
potentially this can have more users.
Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into
tracehook_notify_resume() and do_exit(). But at the same time we can
remove the "replacement_session_keyring != NULL" checks from
arch/*/signal.c and exit_creds().
Note: task_work_add/task_work_run abuses ->pi_lock. This is only because
this lock is already used by lookup_pi_state() to synchronize with
do_exit() setting PF_EXITING. Fortunately the scope of this lock in
task_work.c is really tiny, and the code is unlikely anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Smith <dsmith@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull first series of signal handling cleanups from Al Viro:
"This is just the first part of the queue (about a half of it);
assorted fixes all over the place in signal handling.
This one ends with all sigsuspend() implementations switched to
generic one (->saved_sigmask-based).
With this, a bunch of assorted old buglets are fixed and most of the
missing bits of NOTIFY_RESUME hookup are in place. Two more fixes sit
in arm and um trees respectively, and there's a couple of broken ones
that need obvious fixes - parisc and avr32 check TIF_NOTIFY_RESUME
only on one of two codepaths; fixes for that will happen in the next
series"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (55 commits)
unicore32: if there's no handler we need to restore sigmask, syscall or no syscall
xtensa: add handling of TIF_NOTIFY_RESUME
microblaze: drop 'oldset' argument of do_notify_resume()
microblaze: handle TIF_NOTIFY_RESUME
score: add handling of NOTIFY_RESUME to do_notify_resume()
m68k: add TIF_NOTIFY_RESUME and handle it.
sparc: kill ancient comment in sparc_sigaction()
h8300: missing checks of __get_user()/__put_user() return values
frv: missing checks of __get_user()/__put_user() return values
cris: missing checks of __get_user()/__put_user() return values
powerpc: missing checks of __get_user()/__put_user() return values
sh: missing checks of __get_user()/__put_user() return values
sparc: missing checks of __get_user()/__put_user() return values
avr32: struct old_sigaction is never used
m32r: struct old_sigaction is never used
xtensa: xtensa_sigaction doesn't exist
alpha: tidy signal delivery up
score: don't open-code force_sigsegv()
cris: don't open-code force_sigsegv()
blackfin: don't open-code force_sigsegv()
...
Pull user namespace enhancements from Eric Biederman:
"This is a course correction for the user namespace, so that we can
reach an inexpensive, maintainable, and reasonably complete
implementation.
Highlights:
- Config guards make it impossible to enable the user namespace and
code that has not been converted to be user namespace safe.
- Use of the new kuid_t type ensures the if you somehow get past the
config guards the kernel will encounter type errors if you enable
user namespaces and attempt to compile in code whose permission
checks have not been updated to be user namespace safe.
- All uids from child user namespaces are mapped into the initial
user namespace before they are processed. Removing the need to add
an additional check to see if the user namespace of the compared
uids remains the same.
- With the user namespaces compiled out the performance is as good or
better than it is today.
- For most operations absolutely nothing changes performance or
operationally with the user namespace enabled.
- The worst case performance I could come up with was timing 1
billion cache cold stat operations with the user namespace code
enabled. This went from 156s to 164s on my laptop (or 156ns to
164ns per stat operation).
- (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
Most uid/gid setting system calls treat these value specially
anyway so attempting to use -1 as a uid would likely cause
entertaining failures in userspace.
- If setuid is called with a uid that can not be mapped setuid fails.
I have looked at sendmail, login, ssh and every other program I
could think of that would call setuid and they all check for and
handle the case where setuid fails.
- If stat or a similar system call is called from a context in which
we can not map a uid we lie and return overflowuid. The LFS
experience suggests not lying and returning an error code might be
better, but the historical precedent with uids is different and I
can not think of anything that would break by lying about a uid we
can't map.
- Capabilities are localized to the current user namespace making it
safe to give the initial user in a user namespace all capabilities.
My git tree covers all of the modifications needed to convert the core
kernel and enough changes to make a system bootable to runlevel 1."
Fix up trivial conflicts due to nearby independent changes in fs/stat.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
userns: Silence silly gcc warning.
cred: use correct cred accessor with regards to rcu read lock
userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
userns: Convert cgroup permission checks to use uid_eq
userns: Convert tmpfs to use kuid and kgid where appropriate
userns: Convert sysfs to use kgid/kuid where appropriate
userns: Convert sysctl permission checks to use kuid and kgids.
userns: Convert proc to use kuid/kgid where appropriate
userns: Convert ext4 to user kuid/kgid where appropriate
userns: Convert ext3 to use kuid/kgid where appropriate
userns: Convert ext2 to use kuid/kgid where appropriate.
userns: Convert devpts to use kuid/kgid where appropriate
userns: Convert binary formats to use kuid/kgid where appropriate
userns: Add negative depends on entries to avoid building code that is userns unsafe
userns: signal remove unnecessary map_cred_ns
userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
userns: Convert stat to return values mapped from kuids and kgids
userns: Convert user specfied uids and gids in chown into kuids and kgid
userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
...
Fortunately David Howells is looking to change this, with his module signing
patchset. But that's for next merge window...
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPvN++AAoJENkgDmzRrbjxnB8QAJHnsOjx3M+2IwouCMqatNJf
GrVMsy7I8UPJ1JSAR/2sCoWUUpg1xhUm+koO8rPJuJZ7kDtiRKEa5cJ1JsPiYzcc
RA7hWOrN/hzAFSjvdOA4ezXqn3OYaW6S1W64DxN2e0bo73n1srtAZ2lxMsQ/2SOH
xYQDbTK+/6ERTL0lCghxAZYCIrKeO2oWa46EqW6FdEU2bJisxYr5Kthhig7GaKYU
xluQEvjoU7hbRm9wcvrCYR0BIxnohrhQ/m9DRTxqeRHzAShYx0tiilKlS3RfPda6
mlMY7sqOH6MPsUKq8IQIn3Mz4ut8fa9E8Ukzh0rMdGnVz3GwYTnWkWp8oinUs042
BJUMn0ke6OcCdfNwLM0MPUUHXEpzMRrK1Jt2L/S1S7xewoRmJ2UhWgsUHXwL39vu
4HR4k7xS/V5GjCUec0YBKcAFg/ccH1ktWzg6mQ1nNTX73aniAJ0by2NR+n1fZOi2
m/iBYgWXLMJ9nxGbHd7UXFIDDTXS0RRNvGVyRuI82LnOhE3X3GE7wbbRgHQAnPGy
JlnjQUI5sPqbQE2W/+QSGW1e/HgVWmJKwkGONRLVdgkrHdF79gaUVHjp5JOI6JvT
XCm3JLMxRC93ZNJnl3qwMX/2zsTh7SfWbLiB4fzTfr82sCWLhCrnD+PWxx1OwYvZ
Vv3WTJQqPKXWKnkIqKIh
=gI7A
-----END PGP SIGNATURE-----
Merge tag 'module-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
Pull module patches from Rusty Russell, who really sells them:
"Three trivial patches of no real utility. Modules are boring."
But to make things slightly more exciting, he adds:
"Fortunately David Howells is looking to change this, with his module
signing patchset. But that's for next merge window...
Cheers,
Rusty."
* tag 'module-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
Guard check in module loader against integer overflow
modpost: use proper kernel style for autogenerated files
modpost: Stop grab_file() from leaking filedescriptors if fstat() fails
* Implementation of opportunistic suspend (autosleep) and user space interface
for manipulating wakeup sources.
* Hibernate updates from Bojan Smojver and Minho Ban.
* Updates of the runtime PM core and generic PM domains framework related to
PM QoS.
* Assorted fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJPu+jwAAoJEKhOf7ml8uNsOw0P/0w1FqXD64a1laE43JIlBe9w
yHEcLHc9MXN+8lS0XQ6jFiL/VC3U5Sj7Ro+DFKcL2MWX//dfDcZcwA9ep/qh4tHV
tJ987IijdWqJV14pde3xQafhp/9i12rArLxns7S5fzkdfVk0iDjhZZaZy4afFJYM
SuCsDhCwWefZh89+oLikByiFPnhW+f2ZC9YQeokBM/XvZLtxmOiVfL6duloT/Cr+
58jkrJ8xz/5kmmN4bXM4Wlpf9ZIYFXbvtbKrq3GZOXc+LpNKlWQyFgg/pIuxBewC
uSgsNXXV0LFDi5JfER/8l9MMLtJwwc4VHzpLvMnRv+GtwO2/FKIIr9Fcv000IL2N
0/Ppr52M7XpRruM/k+YroUQ4F1oBX6HB4e3rwqC+XG6n5bwn/Jc7kdy7aUojqNLG
Nlr5f0vBjLTSF66Jnel71Bn+gbA1ogER7E+esSTMpyX+RgGJAUVt5oX9IjbXl3PI
bk8xW1csSRxBI2NkFOd9EM3vMzdGc5uu+iOoy7iBvcAK0AEfo2Ml9YuSVFQeqAu0
A96MUW155A+GKMC7I/LK8pTgMvYDedWhVW9uyXpMRjwdFC5/ywZU1aM00tL9HMpG
pzHOFJgsYrf/6VCV8BwqgudRYd0K5EPSGeITCg973os/XzJIOCfJuy+Pn5V/F0ew
lTbi8ipQD0Hh8A/Xt0QB
=Q2vo
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
- Implementation of opportunistic suspend (autosleep) and user space
interface for manipulating wakeup sources.
- Hibernate updates from Bojan Smojver and Minho Ban.
- Updates of the runtime PM core and generic PM domains framework
related to PM QoS.
- Assorted fixes.
* tag 'pm-for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
epoll: Fix user space breakage related to EPOLLWAKEUP
PM / Domains: Make it possible to add devices to inactive domains
PM / Hibernate: Use get_gendisk to verify partition if resume_file is integer format
PM / Domains: Fix computation of maximum domain off time
PM / Domains: Fix link checking when add subdomain
PM / Sleep: User space wakeup sources garbage collector Kconfig option
PM / Sleep: Make the limit of user space wakeup sources configurable
PM / Documentation: suspend-and-cpuhotplug.txt: Fix typo
PM / Domains: Cache device stop and domain power off governor results, v3
PM / Domains: Make device removal more straightforward
PM / Sleep: Fix a mistake in a conditional in autosleep_store()
epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready
PM / QoS: Create device constraints objects on notifier registration
PM / Runtime: Remove device fields related to suspend time, v2
PM / Domains: Rework default domain power off governor function, v2
PM / Domains: Rework default device stop governor function, v2
PM / Sleep: Add user space interface for manipulating wakeup sources, v3
PM / Sleep: Add "prevent autosleep time" statistics to wakeup sources
PM / Sleep: Implement opportunistic sleep, v2
PM / Sleep: Add wakeup_source_activate and wakeup_source_deactivate tracepoints
...
On some machines the number of possible CPUS is not the same as the
number of CPUs that is on the machine. Ftrace uses possible_cpus to
update the tracing structures but the ring buffer only allocates
per cpu buffers for online CPUs when they come up.
When the wakeup tracer was enabled in such a case, the ftrace code
enabled all possible cpu buffers, but the code in ring_buffer_resize()
did not check to see if the buffer in question was allocated. Since
boot up CPUs did not match possible CPUs it caused the following
crash:
BUG: unable to handle kernel NULL pointer dereference at 00000020
IP: [<c1097851>] ring_buffer_resize+0x16a/0x28d
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: [last unloaded: scsi_wait_scan]
Pid: 1387, comm: bash Not tainted 3.4.0-test+ #13 /DG965MQ
EIP: 0060:[<c1097851>] EFLAGS: 00010217 CPU: 0
EIP is at ring_buffer_resize+0x16a/0x28d
EAX: f5a14340 EBX: f6026b80 ECX: 00000ff4 EDX: 00000ff3
ESI: 00000000 EDI: 00000002 EBP: f4275ecc ESP: f4275eb0
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 80050033 CR2: 00000020 CR3: 34396000 CR4: 000007d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process bash (pid: 1387, ti=f4274000 task=f4380cb0 task.ti=f4274000)
Stack:
c109cf9a f6026b98 00000162 00160f68 00000006 00160f68 00000002 f4275ef0
c109d013 f4275ee8 c123b72a c1c0bf00 c1cc81dc 00000005 f4275f98 00000007
f4275f70 c109d0c7 7700000e 75656b61 00000070 f5e90900 f5c4e198 00000301
Call Trace:
[<c109cf9a>] ? tracing_set_tracer+0x115/0x1e9
[<c109d013>] tracing_set_tracer+0x18e/0x1e9
[<c123b72a>] ? _copy_from_user+0x30/0x46
[<c109d0c7>] tracing_set_trace_write+0x59/0x7f
[<c10ec01e>] ? fput+0x18/0x1c6
[<c11f8732>] ? security_file_permission+0x27/0x2b
[<c10eaacd>] ? rw_verify_area+0xcf/0xf2
[<c10ec01e>] ? fput+0x18/0x1c6
[<c109d06e>] ? tracing_set_tracer+0x1e9/0x1e9
[<c10ead77>] vfs_write+0x8b/0xe3
[<c10ebead>] ? fget_light+0x30/0x81
[<c10eaf54>] sys_write+0x42/0x63
[<c1834fbf>] sysenter_do_call+0x12/0x28
This happens with the latency tracer as the ftrace code updates the
saved max buffer via its cpumask and not with a global setting.
Adding a check in ring_buffer_resize() to make sure the buffer being resized
exists, fixes the problem.
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
- Leftover AMD PMU driver fix fix from the end of the v3.4
stabilization cycle.
- Late tools/perf/ changes that missed the first round:
* endianness fixes
* event parsing improvements
* libtraceevent fixes factored out from trace-cmd
* perl scripting engine fixes related to libtraceevent,
* testcase improvements
* perf inject / pipe mode fixes
* plus a kernel side fix
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Update event scheduling constraints for AMD family 15h models
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "sched, perf: Use a single callback into the scheduler"
perf evlist: Show event attribute details
perf tools: Bump default sample freq to 4 kHz
perf buildid-list: Work better with pipe mode
perf tools: Fix piped mode read code
perf inject: Fix broken perf inject -b
perf tools: rename HEADER_TRACE_INFO to HEADER_TRACING_DATA
perf tools: Add union u64_swap type for swapping u64 data
perf tools: Carry perf_event_attr bitfield throught different endians
perf record: Fix documentation for branch stack sampling
perf target: Add cpu flag to sample_type if target has cpu
perf tools: Always try to build libtraceevent
perf tools: Rename libparsevent to libtraceevent in Makefile
perf script: Rename struct event to struct event_format in perl engine
perf script: Explicitly handle known default print arg type
perf tools: Add hardcoded name term for pmu events
perf tools: Separate 'mem:' event scanner bits
perf tools: Use allocated list for each parsed event
perf tools: Add support for displaying event parser debug info
perf test: Move parse event automated tests to separated object
Pull fpu state cleanups from Ingo Molnar:
"This tree streamlines further aspects of FPU handling by eliminating
the prepare_to_copy() complication and moving that logic to
arch_dup_task_struct().
It also fixes the FPU dumps in threaded core dumps, removes and old
(and now invalid) assumption plus micro-optimizes the exit path by
avoiding an FPU save for dead tasks."
Fixed up trivial add-add conflict in arch/sh/kernel/process.c that came
in because we now do the FPU handling in arch_dup_task_struct() rather
than the legacy (and now gone) prepare_to_copy().
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, fpu: drop the fpu state during thread exit
x86, xsave: remove thread_has_fpu() bug check in __sanitize_i387_state()
coredump: ensure the fpu state is flushed for proper multi-threaded core dump
fork: move the real prepare_to_copy() users to arch_dup_task_struct()
Pull exception table generation updates from Ingo Molnar:
"The biggest change here is to allow the build-time sorting of the
exception table, to speed up booting. This is achieved by the
architecture enabling BUILDTIME_EXTABLE_SORT. This option is enabled
for x86 and MIPS currently.
On x86 a number of fixes and changes were needed to allow build-time
sorting of the exception table, in particular a relocation invariant
exception table format was needed. This required the abstracting out
of exception table protocol and the removal of 20 years of accumulated
assumptions about the x86 exception table format.
While at it, this tree also cleans up various other aspects of
exception handling, such as early(er) exception handling for
rdmsr_safe() et al.
All in one, as the result of these changes the x86 exception code is
now pretty nice and modern. As an added bonus any regressions in this
code will be early and violent crashes, so if you see any of those,
you'll know whom to blame!"
Fix up trivial conflicts in arch/{mips,x86}/Kconfig files due to nearby
modifications of other core architecture options.
* 'x86-extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
Revert "x86, extable: Disable presorted exception table for now"
scripts/sortextable: Handle relative entries, and other cleanups
x86, extable: Switch to relative exception table entries
x86, extable: Disable presorted exception table for now
x86, extable: Add _ASM_EXTABLE_EX() macro
x86, extable: Remove open-coded exception table entries in arch/x86/ia32/ia32entry.S
x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/xsave.h
x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/kvm_host.h
x86, extable: Remove the now-unused __ASM_EX_SEC macros
x86, extable: Remove open-coded exception table entries in arch/x86/xen/xen-asm_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/um/checksum_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/usercopy_32.c
x86, extable: Remove open-coded exception table entries in arch/x86/lib/putuser.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/getuser.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/csum-copy_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_nocache_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/checksum_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/kernel/test_rodata.c
x86, extable: Remove open-coded exception table entries in arch/x86/kernel/entry_64.S
...
Pull UML updates from Richard Weinberger:
"Most changes are bug fixes and cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: missing checks of __put_user()/__get_user() return values
um: stub_rt_sigsuspend isn't needed these days anymore
um/x86: merge (and trim) 32- and 64-bit variants of ptrace.h
irq: Remove irq_chip->release()
um: Remove CONFIG_IRQ_RELEASE_METHOD
um: Remove usage of irq_chip->release()
um: Implement um_free_irq()
um: Fix __swp_type()
um: Implement a custom pte_same() function
um: Add BUG() to do_ops()'s error path
um: Remove unused variables
um: bury unused _TIF_RESTORE_SIGMASK
um: wrong sigmask saved in case of multiple sigframes
um: add TIF_NOTIFY_RESUME
um: ->restart_block.fn needs to be reset on sigreturn
This reverts commit cb04ff9ac4 ("sched, perf: Use a single
callback into the scheduler").
Before this change was introduced, the process switch worked
like this (wrt. to perf event schedule):
schedule (prev, next)
- schedule out all perf events for prev
- switch to next
- schedule in all perf events for current (next)
After the commit, the process switch looks like:
schedule (prev, next)
- schedule out all perf events for prev
- schedule in all perf events for (next)
- switch to next
The problem is, that after we schedule perf events in, the pmu
is enabled and we can receive events even before we make the
switch to next - so "current" still being prev process (event
SAMPLE data are filled based on the value of the "current"
process).
Thats exactly what we see for test__PERF_RECORD test. We receive
SAMPLES with PID of the process that our tracee is scheduled
from.
Discussed with Peter Zijlstra:
> Bah!, yeah I guess reverting is the right thing for now. Sad
> though.
>
> So by having the two hooks we have a black-spot between them
> where we receive no events at all, this black-spot covers the
> hand-over of current and we thus don't receive the 'wrong'
> events.
>
> I rather liked we could do away with both that black-spot and
> clean up the code a little, but apparently people rely on it.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: acme@redhat.com
Cc: paulus@samba.org
Cc: cjashfor@linux.vnet.ibm.com
Cc: fweisbec@gmail.com
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The check:
if (len < hdr->e_shoff + hdr->e_shnum * sizeof(Elf_Shdr))
may not work if there's an overflow in the right-hand side of the condition.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Pull trivial updates from Jiri Kosina:
"As usual, it's mostly typo fixes, redundant code elimination and some
documentation updates."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits)
edac, mips: don't change code that has been removed in edac/mips tree
xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer
lib: Change mail address of Oskar Schirmer
net: Change mail address of Oskar Schirmer
arm/m68k: Change mail address of Sebastian Hess
i2c: Change mail address of Oskar Schirmer
net: Fix tcp_build_and_update_options comment in struct tcp_sock
atomic64_32.h: fix parameter naming mismatch
Kconfig: replace "--- help ---" with "---help---"
c2port: fix bogus Kconfig "default no"
edac: Fix spelling errors.
qla1280: Remove redundant NULL check before release_firmware() call
remoteproc: remove redundant NULL check before release_firmware()
qla2xxx: Remove redundant NULL check before release_firmware() call.
aic94xx: Get rid of redundant NULL check before release_firmware() call
tehuti: delete redundant NULL check before release_firmware()
qlogic: get rid of a redundant test for NULL before call to release_firmware()
bna: remove redundant NULL test before release_firmware()
tg3: remove redundant NULL test before release_firmware() call
typhoon: get rid of redundant conditional before all to release_firmware()
...
Pull scheduler changes from Ingo Molnar:
"The biggest change is the cleanup/simplification of the load-balancer:
instead of the current practice of architectures twiddling scheduler
internal data structures and providing the scheduler domains in
colorfully inconsistent ways, we now have generic scheduler code in
kernel/sched/core.c:sched_init_numa() that looks at the architecture's
node_distance() parameters and (while not fully trusting it) deducts a
NUMA topology from it.
This inevitably changes balancing behavior - hopefully for the better.
There are various smaller optimizations, cleanups and fixlets as well"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
sched: Remove stale power aware scheduling remnants and dysfunctional knobs
sched/debug: Fix printing large integers on 32-bit platforms
sched/fair: Improve the ->group_imb logic
sched/nohz: Fix rq->cpu_load[] calculations
sched/numa: Don't scale the imbalance
sched/fair: Revert sched-domain iteration breakage
sched/x86: Rewrite set_cpu_sibling_map()
sched/numa: Fix the new NUMA topology bits
sched/numa: Rewrite the CONFIG_NUMA sched domain support
sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
sched/fair: Add some serialization to the sched_domain load-balance walk
sched/fair: Let minimally loaded cpu balance the group
sched: Change rq->nr_running to unsigned int
x86/numa: Check for nonsensical topologies on real hw as well
x86/numa: Hard partition cpu topology masks on node boundaries
x86/numa: Allow specifying node_distance() for numa=fake
x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
sched: Update documentation and comments
sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()
Pull perf changes from Ingo Molnar:
"Lots of changes:
- (much) improved assembly annotation support in perf report, with
jump visualization, searching, navigation, visual output
improvements and more.
- kernel support for AMD IBS PMU hardware features. Notably 'perf
record -e cycles:p' and 'perf top -e cycles:p' should work without
skid now, like PEBS does on the Intel side, because it takes
advantage of IBS transparently.
- the libtracevents library: it is the first step towards unifying
tracing tooling and perf, and it also gives a tracing library for
external tools like powertop to rely on.
- infrastructure: various improvements and refactoring of the UI
modules and related code
- infrastructure: cleanup and simplification of the profiling
targets code (--uid, --pid, --tid, --cpu, --all-cpus, etc.)
- tons of robustness fixes all around
- various ftrace updates: speedups, cleanups, robustness
improvements.
- typing 'make' in tools/ will now give you a menu of projects to
build and a short help text to explain what each does.
- ... and lots of other changes I forgot to list.
The perf record make bzImage + perf report regression you reported
should be fixed."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (166 commits)
tracing: Remove kernel_lock annotations
tracing: Fix initial buffer_size_kb state
ring-buffer: Merge separate resize loops
perf evsel: Create events initially disabled -- again
perf tools: Split term type into value type and term type
perf hists: Fix callchain ip printf format
perf target: Add uses_mmap field
ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER
ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code()
ftrace: Make ftrace_modify_all_code() global for archs to use
ftrace: Return record ip addr for ftrace_location()
ftrace: Consolidate ftrace_location() and ftrace_text_reserved()
ftrace: Speed up search by skipping pages by address
ftrace: Remove extra helper functions
ftrace: Sort all function addresses, not just per page
tracing: change CPU ring buffer state from tracing_cpumask
tracing: Check return value of tracing_dentry_percpu()
ring-buffer: Reset head page before running self test
ring-buffer: Add integrity check at end of iter read
ring-buffer: Make addition of pages in ring buffer atomic
...
Pull cgroup updates from Tejun Heo:
"cgroup file type addition / removal is updated so that file types are
added and removed instead of individual files so that dynamic file
type addition / removal can be implemented by cgroup and used by
controllers. blkio controller changes which will come through block
tree are dependent on this. Other changes include res_counter cleanup
and disallowing kthread / PF_THREAD_BOUND threads to be attached to
non-root cgroups.
There's a reported bug with the file type addition / removal handling
which can lead to oops on cgroup umount. The issue is being looked
into. It shouldn't cause problems for most setups and isn't a
security concern."
Fix up trivial conflict in Documentation/feature-removal-schedule.txt
* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
res_counter: Account max_usage when calling res_counter_charge_nofail()
res_counter: Merge res_counter_charge and res_counter_charge_nofail
cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
cgroup: remove cgroup_subsys->populate()
cgroup: get rid of populate for memcg
cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
cgroup: make css->refcnt clearing on cgroup removal optional
cgroup: use negative bias on css->refcnt to block css_tryget()
cgroup: implement cgroup_rm_cftypes()
cgroup: introduce struct cfent
cgroup: relocate __d_cgrp() and __d_cft()
cgroup: remove cgroup_add_file[s]()
cgroup: convert memcg controller to the new cftype interface
memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
cgroup: convert all non-memcg controllers to the new cftype interface
cgroup: relocate cftype and cgroup_subsys definitions in controllers
cgroup: merge cft_release_agent cftype array into the base files array
cgroup: implement cgroup_add_cftypes() and friends
cgroup: build list of all cgroups under a given cgroupfs_root
cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
...
Pull workqueue changes from Tejun Heo:
"Nothing exciting. Most are updates to debug stuff and related fixes.
Two not-too-critical bugs are fixed - WARN_ON() triggering spurious
during cpu offlining and unlikely lockdep related oops."
* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
lockdep: fix oops in processing workqueue
workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is active
workqueue: Catch more locking problems with flush_work()
workqueue: change BUG_ON() to WARN_ON()
trace: Remove unused workqueue tracer
Here is the big staging tree pull request for the 3.5-rc1 merge window.
Loads of changes here, and we just narrowly added more lines than we
added:
622 files changed, 28356 insertions(+), 26059 deletions(-)
But, good news is that there is a number of subsystems that moved out of
the staging tree, to their respective "real" portions of the kernel.
Code that moved out was:
- iio core code
- mei driver
- vme core and bridge drivers
There was one broken network driver that moved into staging as a step
before it is removed from the tree (pc300), and there was a few new
drivers added to the tree:
- new iio drivers
- gdm72xx wimax USB driver
- ipack subsystem and 2 drivers
All of the movements around have acks from the various subsystem
maintainers, and all of this has been in the linux-next tree for a
while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAk+7q8MACgkQMUfUDdst+ymjogCguo8fANFVlPWeZGeoBTL+aQfQ
yTkAoLE0codmh+2SvhulYgyU1Wh6ZDK2
=nJ2F
-----END PGP SIGNATURE-----
Merge tag 'staging-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging tree changes from Greg Kroah-Hartman:
"Here is the big staging tree pull request for the 3.5-rc1 merge
window.
Loads of changes here, and we just narrowly added more lines than we
added:
622 files changed, 28356 insertions(+), 26059 deletions(-)
But, good news is that there is a number of subsystems that moved out
of the staging tree, to their respective "real" portions of the
kernel.
Code that moved out was:
- iio core code
- mei driver
- vme core and bridge drivers
There was one broken network driver that moved into staging as a step
before it is removed from the tree (pc300), and there was a few new
drivers added to the tree:
- new iio drivers
- gdm72xx wimax USB driver
- ipack subsystem and 2 drivers
All of the movements around have acks from the various subsystem
maintainers, and all of this has been in the linux-next tree for a
while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fixed up various trivial conflicts, along with a non-trivial one found
in -next and pointed out by Olof Johanssen: a clean - but incorrect -
merge of the arch/arm/boot/dts/at91sam9g20.dtsi file. Fix up manually
as per Stephen Rothwell.
* tag 'staging-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (536 commits)
Staging: bcm: Remove two unused variables from Adapter.h
Staging: bcm: Removes the volatile type definition from Adapter.h
Staging: bcm: Rename all "INT" to "int" in Adapter.h
Staging: bcm: Fix warning: __packed vs. __attribute__((packed)) in Adapter.h
Staging: bcm: Correctly format all comments in Adapter.h
Staging: bcm: Fix all whitespace issues in Adapter.h
Staging: bcm: Properly format braces in Adapter.h
Staging: ipack/bridges/tpci200: remove unneeded casts
Staging: ipack/bridges/tpci200: remove TPCI200_SHORTNAME constant
Staging: ipack: remove board_name and bus_name fields from struct ipack_device
Staging: ipack: improve the register of a bus and a device in the bus.
staging: comedi: cleanup all the comedi_driver 'detach' functions
staging: comedi: remove all 'default N' in Kconfig
staging: line6/config.h: Delete unused header
staging: gdm72xx depends on NET
staging: gdm72xx: Set up parent link in sysfs for gdm72xx devices
staging: drm/omap: initial dmabuf/prime import support
staging: drm/omap: dmabuf/prime mmap support
pstore/ram: Add ECC support
pstore/ram: Switch to persistent_ram routines
...
Here's the driver core, and other driver subsystems, pull request for
the 3.5-rc1 merge window.
Outside of a few minor driver core changes, we ended up with the
following different subsystem and core changes as well, due to
interdependancies on the driver core:
- hyperv driver updates
- drivers/memory being created and some drivers moved into it
- extcon driver subsystem created out of the old Android staging switch
driver code
- dynamic debug updates
- printk rework, and /dev/kmsg changes
All of this has been tested in the linux-next releases for a few weeks
with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAk+7q28ACgkQMUfUDdst+ykXmwCfcPASzC+/bDkuqdWsqzxlWZ7+
VOQAnAriySv397St36J6Hz5bMQZwB1Yq
=SQc+
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg Kroah-Hartman:
"Here's the driver core, and other driver subsystems, pull request for
the 3.5-rc1 merge window.
Outside of a few minor driver core changes, we ended up with the
following different subsystem and core changes as well, due to
interdependancies on the driver core:
- hyperv driver updates
- drivers/memory being created and some drivers moved into it
- extcon driver subsystem created out of the old Android staging
switch driver code
- dynamic debug updates
- printk rework, and /dev/kmsg changes
All of this has been tested in the linux-next releases for a few weeks
with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fix up conflicts in drivers/extcon/extcon-max8997.c where git noticed
that a patch to the deleted drivers/misc/max8997-muic.c driver needs to
be applied to this one.
* tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (90 commits)
uio_pdrv_genirq: get irq through platform resource if not set otherwise
memory: tegra{20,30}-mc: Remove empty *_remove()
printk() - isolate KERN_CONT users from ordinary complete lines
sysfs: get rid of some lockdep false positives
Drivers: hv: util: Properly handle version negotiations.
Drivers: hv: Get rid of an unnecessary check in vmbus_prep_negotiate_resp()
memory: tegra{20,30}-mc: Use dev_err_ratelimited()
driver core: Add dev_*_ratelimited() family
Driver Core: don't oops with unregistered driver in driver_find_device()
printk() - restore prefix/timestamp printing for multi-newline strings
printk: add stub for prepend_timestamp()
ARM: tegra30: Make MC optional in Kconfig
ARM: tegra20: Make MC optional in Kconfig
ARM: tegra30: MC: Remove unnecessary BUG*()
ARM: tegra20: MC: Remove unnecessary BUG*()
printk: correctly align __log_buf
ARM: tegra30: Add Tegra Memory Controller(MC) driver
ARM: tegra20: Add Tegra Memory Controller(MC) driver
printk() - restore timestamp printing at console output
printk() - do not merge continuation lines of different threads
...
guts of saved_sigmask-based sigsuspend/rt_sigsuspend. Takes
kernel sigset_t *.
Open-coded instances replaced with calling it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull core irq changes from Ingo Molnar:
"A collection of small fixes."
By Thomas Gleixner
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hexagon: Remove select of not longer existing Kconfig switches
arm: Select core options instead of redefining them
genirq: Do not consider disabled wakeup irqs
genirq: Allow check_wakeup_irqs to notice level-triggered interrupts
genirq: Be more informative on irq type mismatch
genirq: Reject bogus threaded irq requests
genirq: Streamline irq_action
Pull security subsystem updates from James Morris:
"New notable features:
- The seccomp work from Will Drewry
- PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
- Longer security labels for Smack from Casey Schaufler
- Additional ptrace restriction modes for Yama by Kees Cook"
Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
apparmor: fix long path failure due to disconnected path
apparmor: fix profile lookup for unconfined
ima: fix filename hint to reflect script interpreter name
KEYS: Don't check for NULL key pointer in key_validate()
Smack: allow for significantly longer Smack labels v4
gfp flags for security_inode_alloc()?
Smack: recursive tramsmute
Yama: replace capable() with ns_capable()
TOMOYO: Accept manager programs which do not start with / .
KEYS: Add invalidation support
KEYS: Do LRU discard in full keyrings
KEYS: Permit in-place link replacement in keyring list
KEYS: Perform RCU synchronisation on keys prior to key destruction
KEYS: Announce key type (un)registration
KEYS: Reorganise keys Makefile
KEYS: Move the key config into security/keys/Kconfig
KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
Yama: remove an unused variable
samples/seccomp: fix dependencies on arch macros
Yama: add additional ptrace scopes
...
Pull smp hotplug cleanups from Thomas Gleixner:
"This series is merily a cleanup of code copied around in arch/* and
not changing any of the real cpu hotplug horrors yet. I wish I'd had
something more substantial for 3.5, but I underestimated the lurking
horror..."
Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
arch/sparc/include/asm/thread_info_32.h
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
um: Remove leftover declaration of alloc_task_struct_node()
task_allocator: Use config switches instead of magic defines
sparc: Use common threadinfo allocator
score: Use common threadinfo allocator
sh-use-common-threadinfo-allocator
mn10300: Use common threadinfo allocator
powerpc: Use common threadinfo allocator
mips: Use common threadinfo allocator
hexagon: Use common threadinfo allocator
m32r: Use common threadinfo allocator
frv: Use common threadinfo allocator
cris: Use common threadinfo allocator
x86: Use common threadinfo allocator
c6x: Use common threadinfo allocator
fork: Provide kmemcache based thread_info allocator
tile: Use common threadinfo allocator
fork: Provide weak arch_release_[task_struct|thread_info] functions
fork: Move thread info gfp flags to header
fork: Remove the weak insanity
sh: Remove cpu_idle_wait()
...
Pull RCU changes from Ingo Molnar:
"This is the v3.5 RCU tree from Paul E. McKenney:
1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature (with
more on the way for 3.6). Posted to LKML:
https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
https://lkml.org/lkml/2012/4/16/611 (commit 4),
https://lkml.org/lkml/2012/4/30/390 (commit 6), and
https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
the other commits for the convenience of the tester).
2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
that have no RCU callbacks. Posted to LKML:
https://lkml.org/lkml/2012/4/23/322.
3) A couple of commits that improve the efficiency of the interaction
between preemptible RCU and the scheduler, these two being all that
survived an abortive attempt to allow preemptible RCU's
__rcu_read_lock() to be inlined. The full set was posted to LKML at
https://lkml.org/lkml/2012/4/14/143, and the first and third patches
of that set remain.
4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
call_srcu() and srcu_barrier(). A major feature of this new
implementation is that synchronize_srcu() no longer disturbs the
execution of other CPUs. This work is based on earlier
implementations by Peter Zijlstra and Paul E. McKenney. Posted to
LKML: https://lkml.org/lkml/2012/2/22/82.
5) A number of miscellaneous bug fixes and improvements which were
posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
subsequent updates posted to LKML."
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
rcu: Make rcu_barrier() less disruptive
rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables
rcu: Make RCU_FAST_NO_HZ handle timer migration
rcu: Update RCU maintainership
rcu: Make exit_rcu() more precise and consolidate
rcu: Move PREEMPT_RCU preemption to switch_to() invocation
rcu: Ensure that RCU_FAST_NO_HZ timers expire on correct CPU
rcu: Add rcutorture test for call_srcu()
rcu: Implement per-domain single-threaded call_srcu() state machine
rcu: Use single value to handle expedited SRCU grace periods
rcu: Improve srcu_readers_active_idx()'s cache locality
rcu: Remove unused srcu_barrier()
rcu: Implement a variant of Peter's SRCU algorithm
rcu: Improve SRCU's wait_idx() comments
rcu: Flip ->completed only once per SRCU grace period
rcu: Increment upper bit only for srcu_read_lock()
rcu: Remove fast check path from __synchronize_srcu()
rcu: Direct algorithmic SRCU implementation
rcu: Introduce rcutorture testing for rcu_barrier()
timer: Fix mod_timer_pinned() header comment
...
Pull core locking updates from Ingo Molnar:
"This update:
- extends and simplifies x86 NMI callback handling code to enhance
and fix the HP hw-watchdog driver
- simplifies the x86 NMI callback handling code to fix a kmemcheck
bug.
- enhances the hung-task debugger"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix the type of the nmiaction.flags field
x86/nmi: Fix page faults by nmiaction if kmemcheck is enabled
x86/nmi: Add new NMI queues to deal with IO_CHK and SERR
watchdog, hpwdt: Remove priority option for NMI callback
hung task debugging: Inject NMI when hung and going to panic
When repeating a UTC time value during a leap second (when the UTC
time should be 23:59:60), the TAI timescale should not stop. The kernel
NTP code increments the TAI offset one second too late. This patch fixes
the issue by incrementing the offset during the leap second itself.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Sigh, I missed to check which architecture Kconfig files actually
include the core Kconfig file. There are a few which did not. So we
broke them.
Instead of adding the includes to those, we are better off to move the
include to init/Kconfig like we did already with irqs and others.
This does not change anything for the architectures using the old
style periodic timer mode. It just solves the build wreckage there.
For those architectures which use the clock events infrastructure it
moves the include of the core Kconfig file to "General setup" which is
a way more logical place than having it at random locations specified
by the architecture specific Kconfigs.
Reported-by: Ingo Molnar <mingo@kernel.org>
Cc: Anna-Maria Gleixner <anna-maria@glx-um.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As it's only user (UML) does no longer need it we can get
rid of it.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
We really don't want all the arch code defining stuff
over and over.
[ anna-maria: Added missing GENERIC_CMOS_UPDATE switch ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@glx-um.de>
Cc: Paul Mundt <lethal@linux-sh.org>
Link: http://lkml.kernel.org/r/1337529587.3208.2.camel@dionysos
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Fixes for perf/core:
- Rename some perf_target methods to avoid double negation, from Namhyung Kim.
- Revert change to use per task events with inheritance, from Namhyung Kim.
- Events should start disabled till children starts running, from David Ahern.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On 32bit builds gcc says:
kernel/user.c:30:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]
kernel/user.c:38:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]
Silence gcc by changing the constant 4294967295 to 4294967295U.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Convert to pr_fmt before things start to get out of hand and some
janitors start getting overly excited.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Presently irq_domain_simple_map() isn't labelled as static, but there's
no definition for it in the public irqdomain header either. At present
all in-tree ->map users have meaningful work to do, and all others are
using irq_domain_simple_ops directly. Make it static for now, as it can
always be exported and added to the public API later.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
modules making use of irq domains at the very least need access to the
add/remove/lookup routines, though there's nothing preventing them from
using the remainder of the public API, either.
The current set of exports seem primarily geared at DT-enabled platforms
using DT-backed IRQ domains, where many of the API accesses are hidden
away in OF code. The non-DT cases need to do most of this on their own.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Now that IRQ domains are being used by modules it's necessary to support
removing them, too. This adds a new irq_domain_remove() routine for doing
the bulk of the heavy lifting. It's left as an exercise to the caller to
ensure all mappings have been appropriatey disposed of before attempting
to remove the domain.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Make sure that the state of buffer_size_kb is initialized correctly and
returns actual size of the ring buffer.
Link: http://lkml.kernel.org/r/1336066834-1673-1-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are 2 separate loops to resize cpu buffers that are online and
offline. Merge them to make the code look better.
Also change the name from update_completion to update_done to allow
shorter lines.
Link: http://lkml.kernel.org/r/1337372991-14783-1-git-send-email-vnagarnaik@google.com
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Sometimes resume= parameter comes in integer style (e.g. major:minor)
and then name_to_dev_t can not detect partition properly. (especially
async device like usb, mmc).
This patch calls get_gendisk() if resumewait is true and resume_file
is in integer format to work around this problem.
Signed-off-by: Minho Ban <mhban@samsung.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Merge reason: We are going to queue up a dependent patch:
"perf tools: Move parse event automated tests to separated object"
That depends on:
commit e7c72d8
perf tools: Add 'G' and 'H' modifiers to event parsing
Conflicts:
tools/perf/builtin-stat.c
Conflicted with the recent 'perf_target' patches when checking the
result of perf_evsel open routines to see if a retry is needed to cope
with older kernels where the exclude guest/host perf_event_attr bits
were not used.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This patch moves kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop(),
to serialize the crash-logging process via smp_send_stop() and to
thus retrieve a more stable crash image of all CPUs stopped.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: dle-develop@lists.sourceforge.net <dle-develop@lists.sourceforge.net>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: a.p.zijlstra@chello.nl <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5C4C569E8A4B9B42A84A977CF070A35B2E4D7A5CE2@USINDEVS01.corp.hds.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Usually sleep-in-atomic bugs are followed by dozens other warnings.
This patch should help to figure out original source of problem.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120510122004.4873.12726.stgit@zurg
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit "userns: Convert setting and getting uid and gid system calls to use
kuid and kgid has modified the accessors in wait_task_continued() and
wait_task_stopped() to use __task_cred() instead of task_uid().
__task_cred() assumes that we're inside a rcu read lock, which is untrue
for these two functions.
Modify it to use task_uid() instead.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Pull perf, x86 and scheduler updates from Ingo Molnar.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing: Do not enable function event with enable
perf stat: handle ENXIO error for perf_event_open
perf: Turn off compiler warnings for flex and bison generated files
perf stat: Fix case where guest/host monitoring is not supported by kernel
perf build-id: Fix filename size calculation
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, kvm: KVM paravirt kernels don't check for CPUID being unavailable
x86: Fix section annotation of acpi_map_cpu2node()
x86/microcode: Ensure that module is only loaded on supported Intel CPUs
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption
It's been broken forever (i.e. it's not scheduling in a power
aware fashion), as reported by Suresh and others sending
patches, and nobody cares enough to fix it properly ...
so remove it to make space free for something better.
There's various problems with the code as it stands today, first
and foremost the user interface which is bound to topology
levels and has multiple values per level. This results in a
state explosion which the administrator or distro needs to
master and almost nobody does.
Furthermore large configuration state spaces aren't good, it
means the thing doesn't just work right because it's either
under so many impossibe to meet constraints, or even if
there's an achievable state workloads have to be aware of
it precisely and can never meet it for dynamic workloads.
So pushing this kind of decision to user-space was a bad idea
even with a single knob - it's exponentially worse with knobs
on every node of the topology.
There is a proposal to replace the user interface with a single
3 state knob:
sched_balance_policy := { performance, power, auto }
where 'auto' would be the preferred default which looks at things
like Battery/AC mode and possible cpufreq state or whatever the hw
exposes to show us power use expectations - but there's been no
progress on it in the past many months.
Aside from that, the actual implementation of the various knobs
is known to be broken. There have been sporadic attempts at
fixing things but these always stop short of reaching a mergable
state.
Therefore this wholesale removal with the hopes of spurring
people who care to come forward once again and work on a
coherent replacement.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The function tracer will enable the -pg option with gcc, which requires
that frame pointers. When FRAME_POINTER is defined in the kernel config
it adds the gcc option -fno-omit-frame-pointer which causes some problems
on some architectures. For those architectures, the FRAME_POINTER select
was not set.
When FUNCTION_TRACER was selected on these architectures that can not have
-fno-omit-frame-pointer, the -pg option is still set. But when
FRAME_POINTER is not selected, the kernel config would add the gcc option
-fomit-frame-pointer. Adding this option is incompatible with -pg
even on archs that do not need frame pointers with -pg.
The answer to this was to just not add either -fno-omit-frame-pointer
or -fomit-frame-pointer on these archs that want function tracing
but do not set FRAME_POINTER.
As it turns out, for archs that require frame pointers for function
tracing, the same can be used. If gcc requires frame pointers with
-pg, it will simply add it. The best thing to do is not select FRAME_POINTER
when function tracing is selected, and let gcc add it if needed.
Only add the -fno-omit-frame-pointer when something else selects
FRAME_POINTER, but do not add -fomit-frame-pointer if function tracing
is selected.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To remove duplicate code, have the ftrace arch_ftrace_update_code()
use the generic ftrace_modify_all_code(). This requires that the
default ftrace_replace_code() becomes a weak function so that an
arch may override it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Rename __ftrace_modify_code() to ftrace_modify_all_code() and make
it global for all archs to use. This will remove the duplication
of code, as archs that can modify code without stop_machine()
can use it directly outside of the stop_machine() call.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_location() is passed an addr, and returns 1 if the addr is
on a ftrace nop (or caller to ftrace_caller), and 0 otherwise.
To let kprobes know if it should move a breakpoint or not, it
must return the actual addr that is the start of the ftrace nop.
This way a kprobe placed on the location of a ftrace nop, can
instead be placed on the instruction after the nop. Even if the
probe addr is on the second or later byte of the nop, it can
simply be moved forward.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both ftrace_location() and ftrace_text_reserved() do basically the same thing.
They search to see if an address is in the ftace table (contains an address
that may change from nop to call ftrace_caller). The difference is
that ftrace_location() searches a single address, but ftrace_text_reserved()
searches a range.
This also makes the ftrace_text_reserved() faster as it now uses a bsearch()
instead of linearly searching all the addresses within a page.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As all records in a page of the ftrace table are sorted, we can
speed up the search algorithm by checking if the address to look for
falls in between the first and last record ip on the page.
This speeds up both the ftrace_location() and ftrace_text_reserved()
algorithms, as it can skip full pages when the search address is
not in them.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_record_ip() and ftrace_alloc_dyn_node() were from the
time of the ftrace daemon. Although they were still used, they
still make things a bit more complex than necessary.
Move the code into the one function that uses it, and remove the
helper functions.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of just sorting the ip's of the functions per ftrace page,
sort the entire list before adding them to the ftrace pages.
This will allow the bsearch algorithm to be sped up as it can
also sort by pages, not just records within a page.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to Documentation/trace/ftrace.txt:
tracing_cpumask:
This is a mask that lets the user only trace
on specified CPUS. The format is a hex string
representing the CPUS.
The tracing_cpumask currently doesn't affect the tracing state of
per-CPU ring buffers.
This patch enables/disables CPU recording as its corresponding bit in
tracing_cpumask is set/unset.
Link: http://lkml.kernel.org/r/1336096792-25373-3-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If tracing_dentry_percpu() failed, tracing_init_debugfs_percpu()
will try to create each cpu directories on debugfs' root directory
as d_percpu is NULL.
Link: http://lkml.kernel.org/r/1335143517-2285-1-git-send-email-namhyung.kim@lge.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the ring buffer does its consistency test on itself, it
removes the head page, runs the tests, and then adds it back
to what the "head_page" pointer was. But because the head_page
pointer may lack behind the real head page (held by the link
list pointer). The reset may be incorrect.
Instead, if the head_page exists (it does not on first allocation)
reset it back to the real head page before running the consistency
tests. Then it will be put back to its original location after
the tests are complete.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There use to be ring buffer integrity checks after updating the
size of the ring buffer. But now that the ring buffer can modify
the size while the system is running, the integrity checks were
removed, as they require the ring buffer to be disabed to perform
the check.
Move the integrity check to the reading of the ring buffer via the
iterator reads (the "trace" file). As reading via an iterator requires
disabling the ring buffer, it is a perfect place to have it.
If the ring buffer happens to be disabled when updating the size,
we still perform the integrity check.
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Historical prepare_to_copy() is mostly a no-op, duplicated for majority of
the architectures and the rest following the x86 model of flushing the extended
register state like fpu there.
Remove it and use the arch_dup_task_struct() instead.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1336692811-30576-1-git-send-email-suresh.b.siddha@intel.com
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch adds the capability to add new pages to a ring buffer
atomically while write operations are going on. This makes it possible
to expand the ring buffer size without reinitializing the ring buffer.
The new pages are attached between the head page and its previous page.
Link: http://lkml.kernel.org/r/1336096792-25373-2-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch adds the capability to remove pages from a ring buffer
without destroying any existing data in it.
This is done by removing the pages after the tail page. This makes sure
that first all the empty pages in the ring buffer are removed. If the
head page is one in the list of pages to be removed, then the page after
the removed ones is made the head page. This removes the oldest data
from the ring buffer and keeps the latest data around to be read.
To do this in a non-racey manner, tracing is stopped for a very short
time while the pages to be removed are identified and unlinked from the
ring buffer. The pages are freed after the tracing is restarted to
minimize the time needed to stop tracing.
The context in which the pages from the per-cpu ring buffer are removed
runs on the respective CPU. This minimizes the events not traced to only
NMI trace contexts.
Link: http://lkml.kernel.org/r/1336096792-25373-1-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On gcc 4.5 the function tracing_mark_write() would give a warning
of page2 being uninitialized. This is due to a bug in gcc because
the logic prevents page2 from being used uninitialized, and
gcc 4.6+ does not complain (correctly).
Instead of adding a "unitialized" around page2, which could show
a bug later on, I combined page1 and page2 into an array map_pages[].
This binds the two and the two are modified according to nr_pages
(what gcc 4.5 seems to ignore). This no longer gives a warning with
gcc 4.5 nor with gcc 4.6.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
map_cred_ns is a light wrapper around from_kuid with the order of the arguments
reversed. Replace map_cred_ns with from_kuid and remove map_cred_ns.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Export handle_edge_irq() and irq_to_desc() to modules to allow them to
do things such as
__irq_set_handler_locked(...., handle_edge_irq);
This fixes
ERROR: "handle_edge_irq" [drivers/gpio/gpio-pch.ko] undefined!
ERROR: "irq_to_desc" [drivers/gpio/gpio-pch.ko] undefined!
when gpio-pch is being built as a module.
This was introduced by commit df9541a60a ("gpio: pch9: Use proper flow
type handlers") that added
__irq_set_handler_locked(d->irq, handle_edge_irq);
but handle_edge_irq() was not exported for modules (and inlined
__irq_set_handler_locked() requires irq_to_desc() exported as well)
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Under memory load, on x86_64, with lockdep enabled, the workqueue's
process_one_work() has been seen to oops in __lock_acquire(), barfing
on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].
Because it's permissible to free a work_struct from its callout function,
the map used is an onstack copy of the map given in the work_struct: and
that copy is made without any locking.
Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
"rep movsq" for that structure copy: which might race with a workqueue
user's wait_on_work() doing lock_map_acquire() on the source of the
copy, putting a pointer into the class_cache[], but only in time for
the top half of that pointer to be copied to the destination map.
Boom when process_one_work() subsequently does lock_map_acquire()
on its onstack copy of the lockdep_map.
Fix this, and a similar instance in call_timer_fn(), with a
lockdep_copy_map() function which additionally NULLs the class_cache[].
Note: this oops was actually seen on 3.4-next, where flush_work() newly
does the racing lock_map_acquire(); but Tejun points out that 3.4 and
earlier are already vulnerable to the same through wait_on_work().
* Patch orginally from Peter. Hugh modified it a bit and wrote the
description.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Hugh Dickins <hughd@google.com>
LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils>
Signed-off-by: Tejun Heo <tj@kernel.org>
* linus/master: (805 commits)
tty: Fix LED error return
openvswitch: checking wrong variable in queue_userspace_packet()
bonding: Fix LACPDU rx_dropped commit.
Linux 3.4-rc7
ARM: EXYNOS: fix ctrlbit for exynos5_clk_pdma1
ARM: EXYNOS: use s5p-timer for UniversalC210 board
ARM / mach-shmobile: Invalidate caches when booting secondary cores
ARM / mach-shmobile: sh73a0 SMP TWD boot regression fix
ARM / mach-shmobile: r8a7779 SMP TWD boot regression fix
ARM: mach-shmobile: convert ag5evm to use the generic MMC GPIO hotplug helper
ARM: mach-shmobile: convert mackerel to use the generic MMC GPIO hotplug helper
MAINTAINERS: Add myself as the cpufreq maintainer
dm mpath: check if scsi_dh module already loaded before trying to load
dm thin: correct module description
dm thin: fix unprotected use of prepared_discards list
dm thin: reinstate missing mempool_free in cell_release_singleton
gpio/exynos: Fix compiler warnings when non-exynos machines are selected
gpio: pch9: Use proper flow type handlers
powerpc/irq: Fix another case of lazy IRQ state getting out of sync
ks8851: Update link status during link change interrupt
...
Conflicts:
drivers/media/common/tuners/xc5000.c
drivers/media/common/tuners/xc5000.h
drivers/usb/gadget/uvc_queue.c
worker_enter_idle() has WARN_ON_ONCE() which triggers if nr_running
isn't zero when every worker is idle. This can trigger spuriously
while a cpu is going down due to the way trustee sets %WORKER_ROGUE
and zaps nr_running.
It first sets %WORKER_ROGUE on all workers without updating
nr_running, releases gcwq->lock, schedules, regrabs gcwq->lock and
then zaps nr_running. If the last running worker enters idle
inbetween, it would see stale nr_running which hasn't been zapped yet
and trigger the WARN_ON_ONCE().
Fix it by performing the sanity check iff the trustee is idle.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Arrange the continuation printk() buffering to be fully separated from the
ordinary full line users.
Limit the exposure to races and wrong printk() line merges to users of
continuation only. Ordinary full line users racing against continuation
users will no longer affect each other.
Multiple continuation users from different threads, racing against each
other will not wrongly be merged into a single line, but printed as
separate lines.
Test output of a kernel module which starts two separate threads which
race against each other, one of them printing a single full terminated
line:
printk("(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)\n");
The other one printing the line, every character separate in a
continuation loop:
printk("(C");
for (i = 0; i < 58; i++)
printk(KERN_CONT "C");
printk(KERN_CONT "C)\n");
Behavior of single and non-thread-aware printk() buffer:
# modprobe printk-race
printk test init
(CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
(CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
New behavior with separate and thread-aware continuation buffer:
# modprobe printk-race
printk test init
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
(CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some numbers like nr_running and nr_uninterruptible are fundamentally
unsigned since its impossible to have a negative amount of tasks, yet
we still print them as signed to easily recognise the underflow
condition.
rq->nr_uninterruptible has 'special' accounting and can in fact very
easily become negative on a per-cpu basis.
It was noted that since the P() macro assumes things are long long and
the promotion of unsigned 'int/long' to long long on 32bit doesn't
sign extend we print silly large numbers instead of the easier to read
signed numbers.
Therefore extend the P() macro to not require the sign extention.
Reported-by: Diwakar Tundlam <dtundlam@nvidia.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-gk5tm8t2n4ix2vkpns42uqqp@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Group imbalance is meant to deal with situations where affinity masks
and sched domains don't align well, such as 3 cpus from one group and
6 from another. In this case the domain based balancer will want to
put an equal amount of tasks on each side even though they don't have
equal cpus.
Currently group_imb is set whenever two cpus of a group have a weight
difference of at least one avg task and the heaviest cpu has at least
two tasks. A group with imbalance set will always be picked as busiest
and a balance pass will be forced.
The problem is that even if there are no affinity masks this stuff can
trigger and cause weird balancing decisions, eg. the observed
behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the
difference of 1 avg load (they all had the same weight) and nr_running
being >1 the group_imbalance logic triggered and did the weird thing
of pulling more load instead of trying to move the 1 excess task to
the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with
1 task.
Curb the group_imbalance stuff by making the nr_running condition
weaker by also tracking the min_nr_running and using the difference in
nr_running over the set instead of the absolute max nr_running.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.
The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.
The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.
This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's far too easy to get ridiculously large imbalance pct when you
scale it like that. Use a fixed 125% for now.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-zsriaft1dv7hhboyrpvqjy6s@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Patches c22402a2f ("sched/fair: Let minimally loaded cpu balance the
group") and 0ce90475 ("sched/fair: Add some serialization to the
sched_domain load-balance walk") are horribly broken so revert them.
The problem is that while it sounds good to have the minimally loaded
cpu do the pulling of more load, the way we walk the domains there is
absolutely no guarantee this cpu will actually get to the domain. In
fact its very likely it wont. Therefore the higher up the tree we get,
the less likely it is we'll balance at all.
The first of mask always walks up, while sucky in that it accumulates
load on the first cpu and needs extra passes to spread it out at least
guarantees a cpu gets up that far and load-balancing happens at all.
Since its now always the first and idle cpus should always be able to
balance so they get a task as fast as possible we can also do away
with the added serialization.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's no need to convert a node number to a node number by
pretending its a cpu number..
Reported-by: Yinghai Lu <yinghai@kernel.org>
Reported-and-Tested-by: Greg Pearson <greg.pearson@hp.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-0sqhrht34phowgclj12dgk8h@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull the v3.5 RCU tree from Paul E. McKenney:
1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature
(with more on the way for 3.6). Posted to LKML:
https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
https://lkml.org/lkml/2012/4/16/611 (commit 4),
https://lkml.org/lkml/2012/4/30/390 (commit 6), and
https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
the other commits for the convenience of the tester).
2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
that have no RCU callbacks. Posted to LKML:
https://lkml.org/lkml/2012/4/23/322.
3) A couple of commits that improve the efficiency of the interaction
between preemptible RCU and the scheduler, these two being all
that survived an abortive attempt to allow preemptible RCU's
__rcu_read_lock() to be inlined. The full set was posted to
LKML at https://lkml.org/lkml/2012/4/14/143, and the first and
third patches of that set remain.
4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
call_srcu() and srcu_barrier(). A major feature of this new
implementation is that synchronize_srcu() no longer disturbs
the execution of other CPUs. This work is based on earlier
implementations by Peter Zijlstra and Paul E. McKenney. Posted to
LKML: https://lkml.org/lkml/2012/2/22/82.
5) A number of miscellaneous bug fixes and improvements which were
posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
subsequent updates posted to LKML.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a stub for prepend_timestamp() when CONFIG_PRINTK is not
enabled. Fixes this build error:
kernel/printk.c:1770:3: error: implicit declaration of function 'prepend_timestamp'
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Make it possible to configure out the user space wakeup sources
garbage collector for debugging and default Android builds.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Arve Hjønnevåg <arve@android.com>
Make it possible to configure out the check against the limit of
user space wakeup sources for debugging and default Android builds.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Arve Hjønnevåg <arve@android.com>
barrier: Reduce the amount of disturbance by rcu_barrier() to the rest of
the system. This branch also includes improvements to
RCU_FAST_NO_HZ, which are included here due to conflicts.
fixes: Miscellaneous fixes.
inline: Remaining changes from an abortive attempt to inline
preemptible RCU's __rcu_read_lock(). These are (1) making
exit_rcu() avoid unnecessary work and (2) avoiding having
preemptible RCU record a blocked thread when the scheduler
declines to do a context switch.
srcu: Lai Jiangshan's algorithmic implementation of SRCU, including
call_srcu().
__log_buf must be aligned, because a 64-bit value is written directly
to it as part of struct log. Alignment of the log entries is typically
handled by log_store(), but this only triggers for subsequent entries,
not the very first (or wrapped) entries.
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fork() failure post namespace creation for a child cloned with
CLONE_NEWPID leaks pid_namespace/mnt_cache due to proc being mounted
during creation, but not unmounted during cleanup. Call
pid_ns_release_proc() during cleanup.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the adding of function tracing event to perf, it caused a
side effect that produces the following warning when enabling all
events in ftrace:
# echo 1 > /sys/kernel/debug/tracing/events/enable
[console]
event trace: Could not enable event function
This is because when enabling all events via the debugfs system
it ignores events that do not have a ->reg() function assigned.
This was to skip over the ftrace internal events (as they are
not TRACE_EVENTs). But as the ftrace function event now has
a ->reg() function attached to it for use with perf, it is no
longer ignored.
Worse yet, this ->reg() function is being called when it should
not be. It returns an error and causes the above warning to
be printed.
By adding a new event_call flag (TRACE_EVENT_FL_IGNORE_ENABLE)
and have all ftrace internel event structures have it set,
setting the events/enable will no longe try to incorrectly enable
the function event and does not warn.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
compat_sys_sigprocmask reads a smaller signal mask from userspace than
sigprogmask accepts for setting. So the high word of blocked.sig[0]
will be cleared, releasing any potentially blocked RT signal.
This was discovered via userspace code that relies on get/setcontext.
glibc's i386 versions of those functions use sigprogmask instead of
rt_sigprogmask to save/restore signal mask and caused RT signal
unblocking this way.
As suggested by Linus, this replaces the sys_sigprocmask based compat
version with one that open-codes the required logic, including the merge
of the existing blocked set with the new one provided on SIG_SETMASK.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The output of the timestamps got lost with the conversion of the
kmsg buffer to records; restore the old behavior.
Document, that CONFIG_PRINTK_TIME now only controls the output of
the timestamps in the syslog() system call and on the console, and
not the recording of the timestamps.
Cc: Joe Perches <joe@perches.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This prevents the merging of printk() continuation lines of different
threads, in the case they race against each other.
It should properly isolate "atomic" single-line printk() users from
continuation users, to make sure the single-line users will never be
merged with the racy continuation ones.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The rcu_barrier() primitive interrupts each and every CPU, registering
a callback on every CPU. Once all of these callbacks have been invoked,
rcu_barrier() knows that every callback that was registered before
the call to rcu_barrier() has also been invoked.
However, there is no point in registering a callback on a CPU that
currently has no callbacks, most especially if that CPU is in a
deep idle state. This commit therefore makes rcu_barrier() avoid
interrupting CPUs that have no callbacks. Doing this requires reworking
the handling of orphaned callbacks, otherwise callbacks could slip through
rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had
not yet interrupted to a CPU that rcu_barrier() had already interrupted.
This reworking was needed anyway to take a first step towards weaning
RCU from the CPU_DYING notifier's use of stop_cpu().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current initialization of the RCU_FAST_NO_HZ per-CPU variables makes
needless and fragile assumptions about the initial value of things like
the jiffies counter. This commit therefore explicitly initializes all of
them that are better started with a non-zero value. It also adds some
comments describing the per-CPU state variables.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a
CPU goes offline, in which case it assumes that the CPU will have to come
out of dyntick-idle mode (cancelling the timer) in order to go offline.
This is important because when RCU_FAST_NO_HZ permits a CPU to enter
dyntick-idle mode despite having RCU callbacks pending, it posts a timer
on that CPU to force a wakeup on that CPU. This wakeup ensures that the
CPU will eventually handle the end of the grace period, including invoking
its RCU callbacks.
However, Pascal Chapperon's test setup shows that the timer handler
rcu_idle_gp_timer_func() really does get invoked in some cases. This is
problematic because this can cause the CPU that entered dyntick-idle
mode despite still having RCU callbacks pending to remain in
dyntick-idle mode indefinitely, which means that its RCU callbacks might
never be invoked. This situation can result in grace-period delays or
even system hangs, which matches Pascal's observations of slow boot-up
and shutdown (https://lkml.org/lkml/2012/4/5/142). See also the bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=806548
This commit therefore causes the "should never be invoked" timer handler
rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up
the CPU for which the timer was intended, allowing that CPU to invoke
its RCU callbacks in a timely manner.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
We can easily use a single callback for both sched-in and sched-out. This
reduces the code footprint in the scheduler path as well as removes
the PMU black spot otherwise present between the out and in callback.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-o56ajxp1edwqg6x9d31wb805@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We always need to pass the last sample period to
perf_sample_data_init(), otherwise the event distribution will be
wrong. Thus, modifiyng the function interface with the required period
as argument. So basically a pattern like this:
perf_sample_data_init(&data, ~0ULL);
data.period = event->hw.last_period;
will now be like that:
perf_sample_data_init(&data, ~0ULL, event->hw.last_period);
Avoids unininitialized data.period and simplifies code.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current code groups up to 16 nodes in a level and then puts an
ALLNODES domain spanning the entire tree on top of that. This doesn't
reflect the numa topology and esp for the smaller not-fully-connected
machines out there today this might make a difference.
Therefore, build a proper numa topology based on node_distance().
Since there's no fixed numa layers anymore, the static SD_NODE_INIT
and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
construct something similar and scales some values either on the
number of cpus in the domain and/or the node_distance() ratio.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: linux-alpha@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: sparclinux@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Greg Pearson <greg.pearson@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: bob.picco@oracle.com
Cc: chris.mason@oracle.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the sched_domain walk is completely unserialized (!SD_SERIALIZE)
it is possible that multiple cpus in the group get elected to do the
next level. Avoid this by adding some serialization.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-vqh9ai6s0ewmeakjz80w4qz6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we let the leftmost (or first idle) cpu ascend the
sched_domain tree and perform load-balancing. The result is that the
busiest cpu in the group might be performing this function and pull
more load to itself. The next load balance pass will then try to
equalize this again.
Change this to pick the least loaded cpu to perform higher domain
balancing.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-v8zlrmgmkne3bkcy9dej1fvm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since there's a PID space limit of 30bits (see
futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a
lower bound of 2 pages per task) would still take 8T of memory it
seems reasonable to say that unsigned int is sufficient for
rq->nr_running.
When we do get anywhere near that amount of tasks I suspect other
things would go funny, load-balancer load computations would really
need to be hoisted to 128bit etc.
So save a few bytes and convert rq->nr_running and friends to
unsigned int.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we have one cpu that failed to boot and boot cpu gave up on
waiting for it and then another cpu is being booted, kernel
might crash with following OOPS:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
Call Trace:
[<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50
The crash happens in init_sched_groups_power() that expects
sched_groups to be circular linked list. However it is not
always true, since sched_groups preallocated in __sdt_alloc are
initialized in build_sched_groups and it may exit early
if (cpu != cpumask_first(sched_domain_span(sd)))
return 0;
without initializing sd->groups->next field.
Fix bug by initializing next field right after sched_group was
allocated.
Also-Reported-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Cc: a.p.zijlstra@chello.nl
Cc: pjt@google.com
Cc: seto.hidetoshi@jp.fujitsu.com
Link: http://lkml.kernel.org/r/1336559908-32533-1-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ftrace_disable_cpu() and ftrace_enable_cpu() functions were
needed back before the ring buffer was lockless. Now that the
ring buffer is lockless (and has been for some time), these functions
serve no purpose, and unnecessarily slow down operations of the tracer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It's appropriate to use __seq_open_private interface to open
some of trace seq files, because it covers all steps we are
duplicating in tracing code - zallocating the iterator and
setting it as seq_file's private.
Using this for following files:
trace
available_filter_functions
enabled_functions
Link: http://lkml.kernel.org/r/1335342219-2782-5-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
[
Fixed warnings for:
kernel/trace/trace.c: In function '__tracing_open':
kernel/trace/trace.c:2418:11: warning: unused variable 'ret' [-Wunused-variable]
kernel/trace/trace.c:2417:19: warning: unused variable 'm' [-Wunused-variable]
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On Tue, May 8, 2012 at 10:02 AM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> kernel/built-in.o: In function `devkmsg_read':
> printk.c:(.text+0x27e8): undefined reference to `__udivdi3'
> Most probably the "msg->ts_nsec / 1000" since
> ts_nsec is a u64 and this is a 32 bit build ...
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace __HAVE_ARCH_TASK_ALLOCATOR and __HAVE_ARCH_THREAD_ALLOCATOR
with proper config switches.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20120505150142.371309416@linutronix.de
Several architectures have their own kmemcache based thread allocator
because THREAD_SIZE is smaller than PAGE_SIZE. Add it to the core code
conditionally on THREAD_SIZE < PAGE_SIZE so the private copies can go.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.491002124@linutronix.de
These flags can be useful for extra allocations outside of the core
code.
Add __GFP_NOTRACK to them, so the archs which have kmemcheck do
not have to provide extra allocators just for that reason.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.428211694@linutronix.de
We error out when compiling with gcc4.1.[01] as it miscompiles
__weak. The workaround with magic defines is not longer
necessary. Make it __weak again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.306358267@linutronix.de
Will replace the misnomed cpu_idle_wait() function which is copied a
gazillion times all over arch/*
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120507175652.049316594@linutronix.de
Support for multiple concurrent readers of /dev/kmsg, with read(),
seek(), poll() support. Output of message sequence numbers, to allow
userspace log consumers to reliably reconnect and reconstruct their
state at any given time. After open("/dev/kmsg"), read() always
returns *all* buffered records. If only future messages should be
read, SEEK_END can be used. In case records get overwritten while
/dev/kmsg is held open, or records get faster overwritten than they
are read, the next read() will return -EPIPE and the current reading
position gets updated to the next available record. The passed
sequence numbers allow the log consumer to calculate the amount of
lost messages.
[root@mop ~]# cat /dev/kmsg
5,0,0;Linux version 3.4.0-rc1+ (kay@mop) (gcc version 4.7.0 20120315 ...
6,159,423091;ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
7,160,424069;pci_root PNP0A03:00: host bridge window [io 0x0000-0x0cf7] (ignored)
SUBSYSTEM=acpi
DEVICE=+acpi:PNP0A03:00
6,339,5140900;NET: Registered protocol family 10
30,340,5690716;udevd[80]: starting version 181
6,341,6081421;FDC 0 is a S82078B
6,345,6154686;microcode: CPU0 sig=0x623, pf=0x0, revision=0x0
7,346,6156968;sr 1:0:0:0: Attached scsi CD-ROM sr0
SUBSYSTEM=scsi
DEVICE=+scsi:1:0:0:0
6,347,6289375;microcode: CPU1 sig=0x623, pf=0x0, revision=0x0
Cc: Karel Zak <kzak@redhat.com>
Tested-by: William Douglas <william.douglas@intel.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
- Record-based stream instead of the traditional byte stream
buffer. All records carry a 64 bit timestamp, the syslog facility
and priority in the record header.
- Records consume almost the same amount, sometimes less memory than
the traditional byte stream buffer (if printk_time is enabled). The record
header is 16 bytes long, plus some padding bytes at the end if needed.
The byte-stream buffer needed 3 chars for the syslog prefix, 15 char for
the timestamp and a newline.
- Buffer management is based on message sequence numbers. When records
need to be discarded, the reading heads move on to the next full
record. Unlike the byte-stream buffer, no old logged lines get
truncated or partly overwritten by new ones. Sequence numbers also
allow consumers of the log stream to get notified if any message in
the stream they are about to read gets discarded during the time
of reading.
- Better buffered IO support for KERN_CONT continuation lines, when printk()
is called multiple times for a single line. The use of KERN_CONT is now
mandatory to use continuation; a few places in the kernel need trivial fixes
here. The buffering could possibly be extended to per-cpu variables to allow
better thread-safety for multiple printk() invocations for a single line.
- Full-featured syslog facility value support. Different facilities
can tag their messages. All userspace-injected messages enforce a
facility value > 0 now, to be able to reliably distinguish them from
the kernel-generated messages. Independent subsystems like a
baseband processor running its own firmware, or a kernel-related
userspace process can use their own unique facility values. Multiple
independent log streams can co-exist that way in the same
buffer. All share the same global sequence number counter to ensure
proper ordering (and interleaving) and to allow the consumers of the
log to reliably correlate the events from different facilities.
Tested-by: William Douglas <william.douglas@intel.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Several distros set this by default by patching panic_on_oops.
It seems to fit with the BOOTPARAM_{HARD,SOFT}_PANIC options
though, so let's add a Kconfig entry and reduce some more
upstream delta.
Signed-off-by: Kyle McMartin <kyle@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120411121529.GH26688@redacted.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Implements trace_event support for uprobes. In its current form
it can be used to put probes at a specified offset in a file and
dump the required registers when the code flow reaches the
probed address.
The following example shows how to dump the instruction pointer
and %ax a register at the probed text address. Here we are
trying to probe zfree in /bin/zsh:
# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g DF .text 0000000000000012 Base
zfree # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120411103043.GB29437@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c. There are no functional changes.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120409091144.8343.76218.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
is_delete and is_return can take utmost 2 values and are better
of being a boolean than a int. There are no functional changes.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120409091133.8343.65289.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The condition check in autosleep_store() is incorrect and prevents
/sys/power/autosleep from working as advertised. Fix that.
[rjw: Added the changelog.]
Signed-off-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
All archs define init_task in the same way (except ia64, but there is
no particular reason why ia64 cannot use the common version). Create a
generic instance so all archs can be converted over.
The config switch is temporary and will be removed when all archs are
converted over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20120503085034.092585287@linutronix.de
I left 1 printk which uses __FILE__, __LINE__ explicitly, which should
not be subject to generic preferences expressed via pr_fmt().
+ tweaks suggested by Joe Perches:
- add doing to irq-enabled warning, like others. It wont happen often..
- change sysfs failure crit, not just err, make it 1 line in logs.
- coalese 2 format fragments into 1 >80 char line
cc: Joe Perches <joe@perches.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit 9fb48c744: "params: add 3rd arg to option handler callback
signature", the if-guard added to the pr_debug was overzealous; no
callers pass NULL, and existing code above and below the guard assumes
as much. Change the if-guard to match, and silence the Smack
complaint.
CC: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If an wakeup interrupt has been disabled before the suspend code
disables all interrupts then we have to ignore the pending flag.
Otherwise we would abort suspend over and over as nothing clears the
pending flag because the interrupt is disabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: NeilBrown <neilb@suse.de>
Level triggered interrupts do not cause IRQS_PENDING to be set when
they fire while "disabled" as the 'pending' state is always present in
the level - they automatically refire where re-enabled.
However the IRQS_PENDING flag is also used to abort a suspend cycle -
if any 'is_wakeup_set' interrupt is PENDING, check_wakeup_irqs() will
cause suspend to abort. Without IRQS_PENDING, suspend won't abort.
Consequently, level-triggered interrupts that fire during the 'noirq'
phase of suspend do not currently abort suspend.
So set IRQS_PENDING even for level triggered interrupts, and make sure
to clear the flag in check_irq_resend.
[ Changelog by courtesy of Neil ]
Tested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ
oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2
MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX
U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9
JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za
xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0=
=4d4w
-----END PGP SIGNATURE-----
Merge tag 'v3.4-rc5' into next
Linux 3.4-rc5
Merge to pull in prerequisite change for Smack:
86812bb0de
Requested by Casey.
percpu areas are already allocated during boot for each possible cpu.
percpu idle threads can be considered as an extension of the percpu areas,
and allocate them for each possible cpu during boot.
This will eliminate the need for workqueue based idle thread allocation.
In future we can move the idle thread area into the percpu area too.
[ tglx: Moved the loop into smpboot.c and added an error check when
the init code failed to allocate an idle thread for a cpu which
should be onlined ]
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: venki@google.com
Link: http://lkml.kernel.org/r/1334966930.28674.245.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Update the permission checks to use the new uid_eq and gid_eq helpers
and remove the now unnecessary user_ns equality comparison.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Convert setregid, setgid, setreuid, setuid,
setresuid, getresuid, setresgid, getresgid, setfsuid, setfsgid,
getuid, geteuid, getgid, getegid,
waitpid, waitid, wait4.
Convert userspace uids and gids into kuids and kgids before
being placed on struct cred. Convert struct cred kuids and
kgids into userspace uids and gids when returning them.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Compare kuids with uid_eq
- kuid are uniuqe across all user namespaces so there is no longer the
need for a user_namespace comparison.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
These function are no longer needed replace them with their more useful equivalents.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
cred.h and a few trivial users of struct cred are changed. The rest of the users
of struct cred are left for other patches as there are too many changes to make
in one go and leave the change reviewable. If the user namespace is disabled and
CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile
and behave correctly.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
As a first step to converting struct cred to be all kuid_t and kgid_t
values convert the group values stored in group_info to always be
kgid_t values. Unless user namespaces are used this change should
have no effect.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
When running preemptible RCU, if a task exits in an RCU read-side
critical section having blocked within that same RCU read-side critical
section, the task must be removed from the list of tasks blocking a
grace period (perhaps the current grace period, perhaps the next grace
period, depending on timing). The exit() path invokes exit_rcu() to
do this cleanup.
However, the current implementation of exit_rcu() needlessly does the
cleanup even if the task did not block within the current RCU read-side
critical section, which wastes time and needlessly increases the size
of the state space. Fix this by only doing the cleanup if the current
task is actually on the list of tasks blocking some grace period.
While we are at it, consolidate the two identical exit_rcu() functions
into a single function.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
kernel/rcupdate.c
Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler.
This is inefficient because enqueuing is required only if there is a
context switch, and entry to the scheduler does not guarantee a context
switch.
The commit therefore moves the enqueuing to immediately precede the
call to switch_to() from the scheduler.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done to resolve a merge issue with the init/main.c file.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Android allows user space to manipulate wakelocks using two
sysfs file located in /sys/power/, wake_lock and wake_unlock.
Writing a wakelock name and optionally a timeout to the wake_lock
file causes the wakelock whose name was written to be acquired (it
is created before is necessary), optionally with the given timeout.
Writing the name of a wakelock to wake_unlock causes that wakelock
to be released.
Implement an analogous interface for user space using wakeup sources.
Add the /sys/power/wake_lock and /sys/power/wake_unlock files
allowing user space to create, activate and deactivate wakeup
sources, such that writing a name and optionally a timeout to
wake_lock causes the wakeup source of that name to be activated,
optionally with the given timeout. If that wakeup source doesn't
exist, it will be created and then activated. Writing a name to
wake_unlock causes the wakeup source of that name, if there is one,
to be deactivated. Wakeup sources created with the help of
wake_lock that haven't been used for more than 5 minutes are garbage
collected and destroyed. Moreover, there can be only WL_NUMBER_LIMIT
wakeup sources created with the help of wake_lock present at a time.
The data type used to track wakeup sources created by user space is
called "struct wakelock" to indicate the origins of this feature.
This version of the patch includes an rbtree manipulation fix from John Stultz.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Android uses one wakelock statistics that is only necessary for
opportunistic sleep. Namely, the prevent_suspend_time field
accumulates the total time the given wakelock has been locked
while "automatic suspend" was enabled. Add an analogous field,
prevent_sleep_time, to wakeup sources and make it behave in a similar
way.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce a mechanism by which the kernel can trigger global
transitions to a sleep state chosen by user space if there are no
active wakeup sources.
It consists of a new sysfs attribute, /sys/power/autosleep, that
can be written one of the strings returned by reads from
/sys/power/state, an ordered workqueue and a work item carrying out
the "suspend" operations. If a string representing the system's
sleep state is written to /sys/power/autosleep, the work item
triggering transitions to that state is queued up and it requeues
itself after every execution until user space writes "off" to
/sys/power/autosleep.
That work item enables the detection of wakeup events using the
functions already defined in drivers/base/power/wakeup.c (with one
small modification) and calls either pm_suspend(), or hibernate() to
put the system into a sleep state. If a wakeup event is reported
while the transition is in progress, it will abort the transition and
the "system suspend" work item will be queued up again.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NeilBrown <neilb@suse.de>
1. Do not allocate memory for buffers from emergency pools, unless
absolutely required. Do not warn about and do not retry non-essential
failed allocations.
2. Do not check the amount of free pages left on every single page
write, but wait until one map is completely populated and then check.
3. Set maximum number of pages for read buffering consistently, instead
of inadvertently depending on the size of the sector type.
4. Fix copyright line, which I missed when I submitted the hibernation
threading patch.
5. Dispense with bit shifting arithmetic to improve readability.
6. Really recalculate the number of pages required to be free after all
allocations have been done.
7. Fix calculation of pages required for read buffering. Only count in
pages that do not belong to high memory.
Signed-off-by: Bojan Smojver <bojan@rexursive.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Timers are subject to migration, which can lead to the following
system-hang scenario when CONFIG_RCU_FAST_NO_HZ=y:
1. CPU 0 executes synchronize_rcu(), which posts an RCU callback.
2. CPU 0 then goes idle. It cannot immediately invoke the callback,
but there is nothing RCU needs from ti, so it enters dyntick-idle
mode after posting a timer.
3. The timer gets migrated to CPU 1.
4. CPU 0 never wakes up, so the synchronize_rcu() never returns, so
the system hangs.
This commit fixes this problem by using mod_timer_pinned(), as suggested
by Peter Zijlstra, to ensure that the timer is actually posted on the
running CPU.
Reported-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ
oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2
MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX
U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9
JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za
xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0=
=4d4w
-----END PGP SIGNATURE-----
Merge tag 'v3.4-rc5' into for-3.5/core
The core branch is behind driver commits that we want to build
on for 3.5, hence I'm pulling in a later -rc.
Linux 3.4-rc5
Conflicts:
Documentation/feature-removal-schedule.txt
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This introduces a fake module param $module.dyndbg. Its based upon
Thomas Renninger's $module.ddebug boot-time debugging patch from
https://lkml.org/lkml/2010/9/15/397
The 'fake' module parameter is provided for all modules, whether or
not they need it. It is not explicitly added to each module, but is
implemented in callbacks invoked from parse_args.
For builtin modules, dynamic_debug_init() now directly calls
parse_args(..., &ddebug_dyndbg_boot_params_cb), to process the params
undeclared in the modules, just after the ddebug tables are processed.
While its slightly weird to reprocess the boot params, parse_args() is
already called repeatedly by do_initcall_levels(). More importantly,
the dyndbg queries (given in ddebug_query or dyndbg params) cannot be
activated until after the ddebug tables are ready, and reusing
parse_args is cleaner than doing an ad-hoc parse. This reparse would
break options like inc_verbosity, but they probably should be params,
like verbosity=3.
ddebug_dyndbg_boot_params_cb() handles both bare dyndbg (aka:
ddebug_query) and module-prefixed dyndbg params, and ignores all other
parameters. For example, the following will enable pr_debug()s in 4
builtin modules, in the order given:
dyndbg="module params +p; module aio +p" module.dyndbg=+p pci.dyndbg
For loadable modules, parse_args() in load_module() calls
ddebug_dyndbg_module_params_cb(). This handles bare dyndbg params as
passed from modprobe, and errors on other unknown params.
Note that modprobe reads /proc/cmdline, so "modprobe foo" grabs all
foo.params, strips the "foo.", and passes these to the kernel.
ddebug_dyndbg_module_params_cb() is again called for the unknown
params; it handles dyndbg, and errors on others. The "doing" arg
added previously contains the module name.
For non CONFIG_DYNAMIC_DEBUG builds, the stub function accepts
and ignores $module.dyndbg params, other unknowns get -ENOENT.
If no param value is given (as in pci.dyndbg example above), "+p" is
assumed, which enables all pr_debug callsites in the module.
The dyndbg fake parameter is not shown in /sys/module/*/parameters,
thus it does not use any resources. Changes to it are made via the
control file.
Also change pr_info in ddebug_exec_queries to vpr_info,
no need to see it all the time.
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
CC: Thomas Renninger <trenn@suse.de>
CC: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a 3rd arg, named "doing", to unknown-options callbacks invoked
from parse_args(). The arg is passed as:
"Booting kernel" from start_kernel(),
initcall_level_names[i] from do_initcall_level(),
mod->name from load_module(), via parse_args(), parse_one()
parse_args() already has the "name" parameter, which is renamed to
"doing" to better reflect current uses 1,2 above. parse_args() passes
it to an altered parse_one(), which now passes it down into the
unknown option handler callbacks.
The mod->name will be needed to handle dyndbg for loadable modules,
since params passed by modprobe are not qualified (they do not have a
"$modname." prefix), and by the time the unknown-param callback is
called, the module name is not otherwise available.
Minor tweaks:
Add param-name to parse_one's pr_debug(), current message doesnt
identify the param being handled, add it.
Add a pr_info to print current level and level_name of the initcall,
and number of registered initcalls at that level. This adds 7 lines
to dmesg output, like:
initlevel:6=device, 172 registered initcalls
Drop "parameters" from initcall_level_names[], its unhelpful in the
pr_info() added above. This array is passed into parse_args() by
do_initcall_level().
CC: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add srcu_torture_deferred_free() for srcu_ops so as to test the new
call_srcu(). Rename the original srcu_ops to srcu_sync_ops.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit implements an SRCU state machine in support of call_srcu().
The state machine is preemptible, light-weight, and single-threaded,
minimizing synchronization overhead. In particular, there is no longer
any need for synchronize_srcu() to be guarded by a mutex.
Expedited processing is handled, at least in the absence of concurrent
grace-period operations on that same srcu_struct structure, by having
the synchronize_srcu_expedited() thread take on the role of the
workqueue thread for one iteration.
There is a reasonable probability that a given SRCU callback will
be invoked on the same CPU that registered it, however, there is no
guarantee. Concurrent SRCU grace-period primitives can cause callbacks
to be executed elsewhere, even in absence of CPU-hotplug operations.
Callbacks execute in process context, but under the influence of
local_bh_disable(), so it is illegal to sleep in an SRCU callback
function.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The earlier algorithm used an "expedited" flag combined with a "trycount"
counter to differentiate between normal and expedited SRCU grace periods.
However, the difference can be encoded into a single counter with a cutoff
value and different initial values for expedited and normal SRCU grace
periods. This commit makes that change.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Conflicts:
kernel/srcu.c
Expand the calls to srcu_readers_active_idx() from srcu_readers_active()
inline. This change improves cache locality by interating over the CPUs
once rather than twice.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit implements a variant of Peter's algorithm, which may be found
at https://lkml.org/lkml/2012/2/1/119.
o Make the checking lock-free to enable parallel checking.
Parallel checking is required when (1) the original checking
task is preempted for a long time, (2) sychronize_srcu_expedited()
starts during an ongoing SRCU grace period, or (3) we wish to
avoid acquiring a lock.
o Since the checking is lock-free, we avoid a mutex in state machine
for call_srcu().
o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
This might allow us to remove the preempt_disable() in future
versions, though such removal will need great care because it
rescinds the one-old-reader-per-CPU guarantee.
o Remove a smp_mb(), simplify the comments and make the smp_mb() pairs
more intuitive.
Inspired-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The safety of SRCU is provided byy wait_idx() rather than flipping.
The flipping actually prevents starvation.
This commit therefore updates the comments to more accurately and
precisely describe what is going on.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This is an optimization of the SRCU grace period. To guard against
preempted readers with old values of the counter, it suffices to scan the
old counters once more, then flip ->completed only one time. The reason
this works is that the old readers must have incremented the old set of
counters (if they have not yet incremented, then their critical section
starts after this grace period, so they may be safely ignored).
This commit therefore optimizes the second flip out in favor of a simple
rescan.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The purpose of the upper bit of SRCU's per-CPU counters is to guarantee
that no reasonable series of srcu_read_lock() and srcu_read_unlock()
operations can return the value of the counter to its original value.
This guarantee is require only after the index has been switched to
the other set of counters, so at most one srcu_read_lock() can affect
a given CPU's counter. The number of srcu_read_unlock() operations
on a given counter is limited to the number of tasks in the system,
which given the Linux kernel's current structure is limited to far less
than 2^30 on 32-bit systems and far less than 2^62 on 64-bit systems.
(Something about a limited number of bytes in the kernel's address space.)
Therefore, if srcu_read_lock() increments the upper bits, then
srcu_read_unlock() need not do so. In this case, an srcu_read_lock() and
an srcu_read_unlock() will flip the lower bit of the upper field of the
counter. An unreasonably large additional number of srcu_read_unlock()
operations would be required to return the counter to its initial value,
thus preserving the guarantee.
This commit takes this approach, which further allows it to shrink
the size of the upper field to one bit, making the number of
srcu_read_unlock() operations required to return the counter to its
initial value even more unreasonable than before.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The fastpath in __synchronize_srcu() is designed to handle cases where
there are a large number of concurrent calls for the same srcu_struct
structure. However, the Linux kernel currently does not use SRCU in
this manner, so remove the fastpath checks for simplicity.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current implementation of synchronize_srcu_expedited() can cause
severe OS jitter due to its use of synchronize_sched(), which in turn
invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
This can result in severe performance degradation for real-time workloads
and especially for short-interation-length HPC workloads. Furthermore,
because only one instance of try_stop_cpus() can be making forward progress
at a given time, only one instance of synchronize_srcu_expedited() can
make forward progress at a time, even if they are all operating on
distinct srcu_struct structures.
This commit, inspired by an earlier implementation by Peter Zijlstra
(https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
takes a strictly algorithmic bits-in-memory approach. This has the
disadvantage of requiring one explicit memory-barrier instruction in
each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
completely dispenses with OS jitter and furthermore allows SRCU to be
used freely by CPUs that RCU believes to be idle or offline.
The update-side implementation handles the single read-side memory
barrier by rechecking the per-CPU counters after summing them and
by running through the update-side state machine twice.
This implementation has passed moderate rcutorture testing on both
x86 and Power. Also updated to use this_cpu_ptr() instead of per_cpu_ptr(),
as suggested by Peter Zijlstra.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Although rcutorture does invoke rcu_barrier() and friends, it cannot
really be called a torture test given that it invokes them only once
at the end of the test. This commit therefore introduces heavy-duty
rcutorture testing for rcu_barrier(), which may be carried out
concurrently with normal rcutorture testing.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Fix for an issue causing hibernation to hang on systems with highmem (that
practically means i386) due to broken memory management (bug introduced in 3.2,
so -stable material) and PM documentation update making the freezer
documentation follow the code again after some recent updates.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJPnbeLAAoJEKhOf7ml8uNs4nkQAKwhfWWfbM7ZepPfT56A5NW1
9vlfgO+1ibUgdjkO0hi1biCAbARTNVS5eCLyRJW0W/msGgL51nYleBeFmwwx5E6h
m5Vwr/53cGeAeF0AOkrQkD45YKJaAlTmWF4/T2YWKLWNMgaVuLmGf7eyYZ6rP1NO
rJxzUOMC6UrRjIA+S2anDU0CdMyqDHvV3OmY+InZBikFCk0YAtDYUYfNDNqQpEBG
bzkG3SyaJeqnbQDkhme7U/uAPJCThSz2Z4gvvOxiXdB+I+yp6FhluhLSGxqMh/kj
OUAJe9s6AAdKz+K62/OgowwucxvmeJRCyYWkN2ZEpsZLoqTEOqLNS4+eaUO6xS/2
tq89LnfSIVFwRx23XeVr/oMfxUJZ8VKZENo5Pm6NjTAYykTeyD4ug/GAHqgXR0TT
B+fvx8QmQ68R843aJsjR9h0AKsSeXfgCAROJt+x0ONYAvmJNV62nzs81broDEl4I
BmWHpOWI7wlzMPt7bNWn4ev4K+WhbVsioFDS61he0Y47Rqt3yUJ8G2OfBq6JYndw
As4ImoPOVGl0+TKcHJ9Y3bVPnsY7fJyF0GeG50NHxsVFsnTv+rYZ9K3GM9KExhO/
5mCfoHNgkOJnhGHfZppbnQBHbmjH8EA3QUx57Abo+Q4wiPNNAVG9P1JpZeyGj8KF
3YML5FjjGQHtYBWeH5WR
=HaVL
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael J. Wysocki:
"Fix for an issue causing hibernation to hang on systems with highmem
(that practically means i386) due to broken memory management (bug
introduced in 3.2, so -stable material) and PM documentation update
making the freezer documentation follow the code again after some
recent updates."
* tag 'pm-for-3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / Freezer / Docs: Update documentation about freezing of tasks
PM / Hibernate: fix the number of pages used for hibernate/thaw buffering
Pull perf fixes from Ingo Molnar.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix perf_event_for_each() to use sibling
perf symbols: Read plt symbols from proper symtab_type binary
tracing: Fix stacktrace of latency tracers (irqsoff and friends)
perf tools: Add 'G' and 'H' modifiers to event parsing
tracing: Fix regression with tracing_on
perf tools: Drop CROSS_COMPILE from flex and bison calls
perf report: Fix crash showing warning related to kernel maps
tracing: Fix build breakage without CONFIG_PERF_EVENTS (again)
Pull build fixes for less mainstream architectures from Paul Gortmaker:
"These are fixes for frv(1), blackfin(2), powerpc(1) and xtensa(4).
Fortunately the touches are nearly all specific to files just used by
the arch in question. The two touches to shared/common files
[kernel/irq/debug.h and drivers/pci/Makefile] are trivial to assess as
no risk to anyone.
Half of them relate to xtensa directly. It was only when I fixed the
last xtensa issue that I realized that the arch has been broken for a
significant time, and isn't a specific v3.4 regression. So if you
wanted, we could leave xtensa lying bleeding in the street for a
couple more weeks and queue those for 3.5. But given they are no risk
to anyone outside of xtensa, I figured to just leave them in.
If you are OK with taking the xtensa fixes, then please pull to get:
- one last implicit include uncovered by system.h that is in a file
specific to just one powerpc defconfig. (I'd sync'd with BenH).
- fix an oversight in the PCI makefile where shared code wasn't being
compiled for ARCH=frv
- fix a missing include for GPIO in blackfin framebuffer.
- audit and tag endif in blackfin ezkit board file, in order to find
and fix the misplaced endif masking a block of code.
- fix irq/debug.h choice of temporary macro names to be more internal
so they don't conflict with names used by xtensa.
- fix a reference to an undeclared local var in xtensa's signal.c
- fix an implicit bug.h usage in xtensa's asm/io.h uncovered by my
removing bug.h from kernel.h
- fix xtensa to properly indicate it is using asm-generic/hardirq.h
in order to resolve the link error - undefined ack_bad_irq
The xtensa still fails final link as my latest binutils does something
evil when ld forward-relocates unlikely() blocks, but in theory people
who have older/valid toolchains could now use the thing."
* 'for-v3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
xtensa: fix build fail on undefined ack_bad_irq
blackfin: fix ifdef fustercluck in mach-bf538/boards/ezkit.c
blackfin: fix compile error in bfin-lq035q1-fb.c
pci: frv architecture needs generic setup-bus infrastructure
irq: hide debug macros so they don't collide with others.
xtensa: fix build error in xtensa/include/asm/io.h
xtensa: fix build failure in xtensa/kernel/signal.c
powerpc: fix system.h fallout in sysdev/scom.c [chroma_defconfig]
Updating max_usage is something one would expect when we reach
a new maximum usage value even when we do this by forcing through
the limit with res_counter_charge_nofail().
(Whether we want to account failcnt when we force through the limit
is another debate).
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Li Zefan <lizefan@huawei.com>
These two functions do almost the same thing and duplicate some code.
Merge their implementation into a single common function.
res_counter_charge_locked() takes one more parameter but it doesn't seem
to be used outside res_counter.c yet anyway.
There is no (intended) change in the behaviour.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Li Zefan <lizefan@huawei.com>
The mod_timer_pinned() header comment states that it prevents timers
from being migrated to a different CPU. This is not the case, instead,
it ensures that the timer is posted to the current CPU, but does nothing
to prevent CPU-hotplug operations from migrating the timer.
This commit therefore brings the comment header into alignment with
reality.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
RCU_FAST_NO_HZ uses a timer to limit the time that a CPU with callbacks
can remain in dyntick-idle mode. This timer is cancelled when the CPU
exits idle, and therefore should never fire. However, if the timer
were migrated to some other CPU for whatever reason (1) the timer could
actually fire and (2) firing on some other CPU would fail to wake up the
CPU with callbacks, possibly resulting in sluggishness or a system hang.
This commit therfore adds a WARN_ON_ONCE() to the timer handler in order
to detect this condition.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In perf_event_for_each() we call a function on an event, and then
iterate over the siblings of the event.
However we don't call the function on the siblings, we call it
repeatedly on the original event - it seems "obvious" that we should
be calling it with sibling as the argument.
It looks like this broke in commit 75f937f24b ("Fix ctx->mutex
vs counter->mutex inversion").
The only effect of the bug is that the PERF_IOC_FLAG_GROUP parameter
to the ioctls doesn't work.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1334109253-31329-1-git-send-email-michael@ellerman.id.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Under extreme memory used up situations, percpu allocation
might fail. We hit it when system goes to suspend-to-ram,
causing a kworker panic:
EIP: [<c124411a>] build_sched_domains+0x23a/0xad0
Kernel panic - not syncing: Fatal exception
Pid: 3026, comm: kworker/u:3
3.0.8-137473-gf42fbef #1
Call Trace:
[<c18cc4f2>] panic+0x66/0x16c
[...]
[<c1244c37>] partition_sched_domains+0x287/0x4b0
[<c12a77be>] cpuset_update_active_cpus+0x1fe/0x210
[<c123712d>] cpuset_cpu_inactive+0x1d/0x30
[...]
With this fix applied build_sched_domains() will return -ENOMEM and
the suspend attempt fails.
Signed-off-by: he, bo <bo.he@intel.com>
Reviewed-by: Zhang, Yanmin <yanmin.zhang@intel.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1335355161.5892.17.camel@hebo
[ So, we fail to deallocate a CPU because we cannot allocate RAM :-/
I don't like that kind of sad behavior but nevertheless it should
not crash under high memory load. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commits 367456c756 ("sched: Ditch per cgroup task lists for
load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
left some more wreckage.
By setting loop_max unconditionally to ->nr_running load-balancing
could take a lot of time on very long runqueues (hackbench!). So keep
the sysctl as max limit of the amount of tasks we'll iterate.
Furthermore, the min load filter for migration completely fails with
cgroups since inequality in per-cpu state can easily lead to such
small loads :/
Furthermore the change to add new tasks to the tail of the queue
instead of the head seems to have some effect.. not quite sure I
understand why.
Combined these fixes solve the huge hackbench regression reported by
Tim when hackbench is ran in a cgroup.
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
[ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All SMP architectures have magic to fork the idle task and to store it
for reusage when cpu hotplug is enabled. Provide a generic
infrastructure for it.
Create/reinit the idle thread for the cpu which is brought up in the
generic code and hand the thread pointer to the architecture code via
__cpu_up().
Note, that fork_idle() is called via a workqueue, because this
guarantees that the idle thread does not get a reference to a user
space VM. This can happen when the boot process did not bring up all
possible cpus and a later cpu_up() is initiated via the sysfs
interface. In that case fork_idle() would be called in the context of
the user space task and take a reference on the user space VM.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: x86@kernel.org
Acked-by: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/r/20120420124557.102478630@linutronix.de
Start a new file, which will hold SMP and CPU hotplug related generic
infrastructure.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20120420124557.035417523@linutronix.de
Preparatory patch to make the idle thread allocation for secondary
cpus generic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20120420124556.964170564@linutronix.de
- Convert the old uid mapping functions into compatibility wrappers
- Add a uid/gid mapping layer from user space uid and gids to kernel
internal uids and gids that is extent based for simplicty and speed.
* Working with number space after mapping uids/gids into their kernel
internal version adds only mapping complexity over what we have today,
leaving the kernel code easy to understand and test.
- Add proc files /proc/self/uid_map /proc/self/gid_map
These files display the mapping and allow a mapping to be added
if a mapping does not exist.
- Allow entering the user namespace without a uid or gid mapping.
Since we are starting with an existing user our uids and gids
still have global mappings so are still valid and useful they just don't
have local mappings. The requirement for things to work are global uid
and gid so it is odd but perfectly fine not to have a local uid
and gid mapping.
Not requiring global uid and gid mappings greatly simplifies
the logic of setting up the uid and gid mappings by allowing
the mappings to be set after the namespace is created which makes the
slight weirdness worth it.
- Make the mappings in the initial user namespace to the global
uid/gid space explicit. Today it is an identity mapping
but in the future we may want to twist this for debugging, similar
to what we do with jiffies.
- Document the memory ordering requirements of setting the uid and
gid mappings. We only allow the mappings to be set once
and there are no pointers involved so the requirments are
trivial but a little atypical.
Performance:
In this scheme for the permission checks the performance is expected to
stay the same as the actuall machine instructions should remain the same.
The worst case I could think of is ls -l on a large directory where
all of the stat results need to be translated with from kuids and
kgids to uids and gids. So I benchmarked that case on my laptop
with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.
My benchmark consisted of going to single user mode where nothing else
was running. On an ext4 filesystem opening 1,000,000 files and looping
through all of the files 1000 times and calling fstat on the
individuals files. This was to ensure I was benchmarking stat times
where the inodes were in the kernels cache, but the inode values were
not in the processors cache. My results:
v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
All of the configurations ran in roughly 120ns when I performed tests
that ran in the cpu cache.
So in summary the performance impact is:
1ns improvement in the worst case with user namespace support compiled out.
8ns aka 5% slowdown in the worst case with user namespace support compiled in.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Transform userns->creator from a user_struct reference to a simple
kuid_t, kgid_t pair.
In cap_capable this allows the check to see if we are the creator of
a namespace to become the classic suser style euid permission check.
This allows us to remove the need for a struct cred in the mapping
functions and still be able to dispaly the user namespace creators
uid and gid as 0.
- Remove the now unnecessary delayed_work in free_user_ns.
All that is left for free_user_ns to do is to call kmem_cache_free
and put_user_ns. Those functions can be called in any context
so call them directly from free_user_ns removing the need for delayed work.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Send an NMI to all CPUs when a hung task is detected and the hung
task code is configured to panic. This gives us a fairly uptodate
snapshot of all CPUs in the system.
This lets us get stack trace of all CPUs which makes life easier
trying to debug a deadlock, and the NMI doesn't change anything
since the next step is a kernel panic.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1331848040-1676-1-git-send-email-levinsasha928@gmail.com
[ extended the changelog a bit ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The rcutorture initialization code ignored the error returns from
rcu_torture_onoff_init() and rcu_torture_stall_init(). The rcutorture
cleanup code failed to NULL out a number of pointers. These bugs will
normally have no effect, but this commit fixes them nevertheless.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Both Steven Rostedt's new idle-capable trace macros and the RCU_NONIDLE()
macro can cause RCU to momentarily pause out of idle without the rest
of the system being involved. This can cause rcu_prepare_for_idle()
to run through its state machine too quickly, which can in turn result
in needless scheduling-clock interrupts.
This commit therefore adds code to enable rcu_prepare_for_idle() to
distinguish between an initial entry to idle on the one hand (which needs
to advance the rcu_prepare_for_idle() state machine) and an idle reentry
due to idle-capable trace macros and RCU_NONIDLE() on the other hand
(which should avoid advancing the rcu_prepare_for_idle() state machine).
Additional state is maintained to allow the timer to be correctly reposted
when returning after a momentary pause out of idle, and even more state
is maintained to detect when new non-lazy callbacks have been enqueued
(which may require re-evaluation of the approach to idleness).
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The RCU_FAST_NO_HZ facility uses an hrtimer to wake up a CPU when
it is allowed to go into dyntick-idle mode, which is almost always
cancelled soon after. This is not what hrtimers are good at, so
this commit switches to the timer wheel.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Traces of rcu_prep_idle events can be confusing because
rcu_cleanup_after_idle() does no tracing. This commit therefore adds
this tracing.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_blocking_is_gp() function tests to see if there is only one
online CPU, and if so, synchronize_sched() and friends become no-ops.
However, for larger systems, num_online_cpus() scans a large vector,
and might be preempted while doing so. While preempted, any number
of CPUs might come online and go offline, potentially resulting in
num_online_cpus() returning 1 when there never had only been one
CPU online. This could result in a too-short RCU grace period, which
could in turn result in total failure, except that the only way that
the grace period is too short is if there is an RCU read-side critical
section spanning it. For RCU-sched and RCU-bh (which are the only
cases using rcu_blocking_is_gp()), RCU read-side critical sections
have either preemption or bh disabled, which prevents CPUs from going
offline. This in turn prevents actual failures from occurring.
This commit therefore adds a large block comment to rcu_blocking_is_gp()
documenting why it is safe. This commit also moves rcu_blocking_is_gp()
into kernel/rcutree.c, which should help prevent unwary developers from
mistaking it for a generally useful function.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
limit of 16 on the leaf-level fanout for the rcu_node tree. This was
needed to reduce lock contention that was induced by the synchronization
of scheduling-clock interrupts, which was in turn needed to improve
energy efficiency for moderate-sized lightly loaded servers.
However, reducing the leaf-level fanout means that there are more
leaf-level rcu_node structures in the tree, which in turn means that
RCU's grace-period initialization incurs more cache misses. This is
not a problem on moderate-sized servers with only a few tens of CPUs,
but becomes a major source of real-time latency spikes on systems with
many hundreds of CPUs. In addition, the workloads running on these large
systems tend to be CPU-bound, which eliminates the energy-efficiency
advantages of synchronizing scheduling-clock interrupts. Therefore,
these systems need maximal values for the rcu_node leaf-level fanout.
This commit addresses this problem by introducing a new kernel parameter
named RCU_FANOUT_LEAF that directly controls the leaf-level fanout.
This parameter defaults to 16 to handle the common case of a moderate
sized lightly loaded servers, but may be set higher on larger systems.
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Hibernation regression fix, since 3.2.
Calculate the number of required free pages based on non-high memory
pages only, because that is where the buffers will come from.
Commit 081a9d043c introduced a new buffer
page allocation logic during hibernation, in order to improve the
performance. The amount of pages allocated was calculated based on total
amount of pages available, although only non-high memory pages are
usable for this purpose. This caused hibernation code to attempt to over
allocate pages on platforms that have high memory, which led to hangs.
Signed-off-by: Bojan Smojver <bojan@rexursive.com>
Signed-off-by: Rafael J. Wysocki <rjw@suse.de>
Add a debugfs entry under per_cpu/ folder for each cpu called
buffer_size_kb to control the ring buffer size for each CPU
independently.
If the global file buffer_size_kb is used to set size, the individual
ring buffers will be adjusted to the given size. The buffer_size_kb will
report the common size to maintain backward compatibility.
If the buffer_size_kb file under the per_cpu/ directory is used to
change buffer size for a specific CPU, only the size of the respective
ring buffer is updated. When tracing/buffer_size_kb is read, it reports
'X' to indicate that sizes of per_cpu ring buffers are not equivalent.
Link: http://lkml.kernel.org/r/1328212844-11889-1-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Justin Teravest <teravest@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
memcpy() returns a pointer to "bug". Hopefully, it's not NULL here or
we would already have Oopsed.
Link: http://lkml.kernel.org/r/20120420063145.GA22649@elgon.mountain
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, trace_printk() uses a single buffer to write into
to calculate the size and format needed to save the trace. To
do this safely in an SMP environment, a spin_lock() is taken
to only allow one writer at a time to the buffer. But this could
also affect what is being traced, and add synchronization that
would not be there otherwise.
Ideally, using percpu buffers would be useful, but since trace_printk()
is only used in development, having per cpu buffers for something
never used is a waste of space. Thus, the use of the trace_bprintk()
format section is changed to be used for static fmts as well as dynamic ones.
Then at boot up, we can check if the section that holds the trace_printk
formats is non-empty, and if it does contain something, then we
know a trace_printk() has been added to the kernel. At this time
the trace_printk per cpu buffers are allocated. A check is also
done at module load time in case a module is added that contains a
trace_printk().
Once the buffers are allocated, they are never freed. If you use
a trace_printk() then you should know what you are doing.
A buffer is made for each type of context:
normal
softirq
irq
nmi
The context is checked and the appropriate buffer is used.
This allows for totally lockless usage of trace_printk(),
and they no longer even disable interrupts.
Requested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a workqueue is flushed with flush_work() lockdep checking can
be circumvented. For example:
static DEFINE_MUTEX(mutex);
static void my_work(struct work_struct *w)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
static DECLARE_WORK(work, my_work);
static int __init start_test_module(void)
{
schedule_work(&work);
return 0;
}
module_init(start_test_module);
static void __exit stop_test_module(void)
{
mutex_lock(&mutex);
flush_work(&work);
mutex_unlock(&mutex);
}
module_exit(stop_test_module);
would not always print a warning when flush_work() was called.
In this trivial example nothing could go wrong since we are
guaranteed module_init() and module_exit() don't run concurrently,
but if the work item is schedule asynchronously we could have a
scenario where the work item is running just at the time flush_work()
is called resulting in a classic ABBA locking problem.
Add a lockdep hint by acquiring and releasing the work item
lockdep_map in flush_work() so that we always catch this
potential deadlock scenario.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Allowing kthreadd to be moved to a non-root group makes no sense, it being
a global resource, and needlessly leads unsuspecting users toward trouble.
1. An RT workqueue worker thread spawned in a task group with no rt_runtime
allocated is not schedulable. Simple user error, but harmful to the box.
2. A worker thread which acquires PF_THREAD_BOUND can never leave a cpuset,
rendering the cpuset immortal.
Save the user some unexpected trouble, just say no.
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The file kernel/irq/debug.h temporarily defines P, PS, PD
and then undefines them. However these names aren't really
"internal" enough, and collide with other more legit users
such as the ones in the xtensa arch, causing:
In file included from kernel/irq/internals.h:58:0,
from kernel/irq/irqdesc.c:18:
kernel/irq/debug.h:8:0: warning: "PS" redefined [enabled by default]
arch/xtensa/include/asm/regs.h:59:0: note: this is the location of the previous definition
Add a handful of underscores to do a better job of hiding these
temporary macros.
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The Android alarm interface provides a settime call that sets both
the alarmtimer RTC device and CLOCK_REALTIME to the same value.
Since there may be multiple rtc devices, provide a hook to access the
one the alarmtimer infrastructure is using.
CC: Colin Cross <ccross@android.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Android Kernel Team <kernel-team@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While debugging a latency with someone on IRC (mirage335) on #linux-rt (OFTC),
we discovered that the stacktrace output of the latency tracers
(preemptirqsoff) was empty.
This bug was caused by the creation of the dynamic length stack trace
again (like commit 12b5da3 "tracing: Fix ent_size in trace output" was).
This bug is caused by the latency tracers requiring the next event
to determine the time between the current event and the next. But by
grabbing the next event, the iter->ent_size is set to the next event
instead of the current one. As the stacktrace event is the last event,
this makes the ent_size zero and causes nothing to be printed for
the stack trace. The dynamic stacktrace uses the ent_size to determine
how much of the stack can be printed. The ent_size of zero means
no stack.
The simple fix is to save the iter->ent_size before finding the next event.
Note, mirage335 asked to remain anonymous from LKML and git, so I will
not add the Reported-by and Tested-by tags, even though he did report
the issue and tested the fix.
Cc: stable@vger.kernel.org # 3.1+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Merge reason: development work has dependency on kvm patches merged
upstream.
Conflicts:
Documentation/feature-removal-schedule.txt
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
During resume, tick_resume_broadcast() programs the broadcast timer in
oneshot mode unconditionally. On the platforms where broadcast timer
is not really required, this will generate spurious broadcast timer
ticks upon resume. For example, on the always running apic timer
platforms with HPET, I see spurious hpet tick once every ~5minutes
(which is the 32-bit hpet counter wraparound time).
Similar to boot time, during resume make the oneshot mode setting of
the broadcast clock event device conditional on the state of active
broadcast users.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Tested-by: svenjoac@gmx.de
Cc: torvalds@linux-foundation.org
Cc: rjw@sisk.pl
Link: http://lkml.kernel.org/r/1334802459.28674.209.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Santosh found another trap when we avoid to initialize the broadcast
device in the switch_to_oneshot code. The broadcast device might be
still in SHUTDOWN state when we actually need to use it. That
obviously breaks, as set_next_event() is called on a shutdown
device. This did not break on x86, but Suresh analyzed it:
From the review, most likely on Sven's system we are force enabling
the hpet using the pci quirk's method very late. And in this case,
hpet_clockevent (which will be global_clock_event) handler can be
null, specifically as this platform might not be using deeper c-states
and using the reliable APIC timer.
Prior to commit 'fa4da365bc7772c', that handler will be set to
'tick_handle_oneshot_broadcast' when we switch the broadcast timer to
oneshot mode, even though we don't use it. Post commit
'fa4da365bc7772c', we stopped switching the broadcast mode to oneshot
as this is not really needed and his platform's global_clock_event's
handler will remain null. While on my SNB laptop, same is set to
'clockevents_handle_noop' because hpet gets enabled very early. (noop
handler on my platform set when the early enabled hpet timer gets
replaced by the lapic timer).
But the commit 'fa4da365bc7772c' tracked the broadcast timer mode in
the SW as oneshot, even though it didn't touch the HW timer. During
resume however, tick_resume_broadcast() saw the SW broadcast mode as
oneshot and actually programmed the broadcast device also into oneshot
mode. So this triggered the null pointer de-reference after the hpet
wraps around and depending on what the hpet counter is set to. On the
normal platforms where hpet gets enabled early we should be seeing a
spurious interrupt (in my SNB laptop I see one spurious interrupt
after around 5 minutes ;) which is 32-bit hpet counter wraparound
time), but that's a separate issue.
Enforce the mode setting when trying to set an event.
Reported-and-tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: torvalds@linux-foundation.org
Cc: svenjoac@gmx.de
Cc: rjw@sisk.pl
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1204181723350.2542@ionos
* tag 'v3.4-rc3': (3755 commits)
Linux 3.4-rc3
x86-32: fix up strncpy_from_user() sign error
ARM: 7386/1: jump_label: fixup for rename to static_key
ARM: 7384/1: ThumbEE: Disable userspace TEEHBR access for !CONFIG_ARM_THUMBEE
ARM: 7382/1: mm: truncate memory banks to fit in 4GB space for classic MMU
ARM: 7359/2: smp_twd: Only wait for reprogramming on active cpus
PCI: Fix regression in pci_restore_state(), v3
SCSI: Fix error handling when no ULD is attached
ARM: OMAP: clock: cleanup CPUfreq leftovers, fix build errors
ARM: dts: remove blank interrupt-parent properties
ARM: EXYNOS: Fix Kconfig dependencies for device tree enabled machine files
do not export kernel's NULL #define to userspace
ARM: EXYNOS: Remove broken config values for touchscren for NURI board
ARM: EXYNOS: set fix xusbxti clock for NURI and Universal210 boards
ARM: EXYNOS: fix regulator name for NURI board
ARM: SAMSUNG: make SAMSUNG_PM_DEBUG select DEBUG_LL
cpufreq: OMAP: fix build errors: depends on ARCH_OMAP2PLUS
sparc64: Eliminate obsolete __handle_softirq() function
sparc64: Fix bootup crash on sun4v.
ARM: msm: Fix section mismatches in proc_comm.c
...
We require that shared interrupts agree on a few flag settings. Right
now we silently return with an error code without giving any hint why
we reject it.
Make the printout unconditionally and actually useful by printing the
flags of the new and the already registered action.
Convert all printks to pr_* and use a proper prefix while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Requesting a threaded interrupt without a primary handler and without
IRQF_ONESHOT set is dangerous.
The core will use the default primary handler for it, which merily
wakes the thread. For a level type interrupt this results in an
interrupt storm, because the interrupt line is reenabled after the
primary handler runs. The device has still the line asserted, which
brings us back into the primary handler.
While this works for edge type interrupts, we play it safe and reject
unconditionally because we can't say for sure which type this
interrupt really has. The type flags are unreliable as the underlying
chip implementation can override them. And we cannot assume that
developers using that interface know what they are doing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Sven Joachim reported, that suspend/resume on rc3 trips over a NULL
pointer dereference. Linus spotted the clockevent handler being NULL.
commit fa4da365b(clockevents: tTack broadcast device mode change in
tick_broadcast_switch_to_oneshot()) tried to fix a problem with the
broadcast device setup, which was introduced in commit 77b0d60c5(
clockevents: Leave the broadcast device in shutdown mode when not
needed).
The initial commit avoided to set up the broadcast device when no
broadcast request bits were set, but that left the broadcast device
disfunctional. In consequence deep idle states which need the
broadcast device were not woken up.
commit fa4da365b tried to fix that by initializing the state of the
broadcast facility, but that missed the fact, that nothing initializes
the event handler and some other state of the underlying clock event
device.
The fix is to revert both commits and make only the mode setting of
the clock event device conditional on the state of active broadcast
users.
That initializes everything except the low level device mode, but this
happens when the broadcast functionality is invoked by deep idle.
Reported-and-tested-by: Sven Joachim <svenjoac@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1204181205540.2542@ionos
If both audit and seccomp filter support are disabled, 'ret' is marked
as unused.
If just seccomp filter support is disabled, data and skip are considered
unused.
This change fixes those build warnings.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
As of:
29494be71a ("rcu,cleanup: simplify the code when cpu is dying")
RCU adopts callbacks from the dying CPU in its CPU_DYING notifier,
which means that any callbacks posted by later CPU_DYING notifiers
are ignored until the CPU comes back online.
A WARN_ON_ONCE() was added to __call_rcu() by:
e560140008 ("rcu: Simplify offline processing")
to check for this condition. Although this condition did not trigger
(at least as far as I know) during -next testing, it did recently
trigger in mainline:
https://lkml.org/lkml/2012/4/2/34
What is needed longer term is for RCU's CPU_DEAD notifier to adopt any
callbacks that were posted by CPU_DYING notifiers, however, the Linux
kernel has been running with this sort of thing happening for quite
some time. So the only thing that qualifies as a regression is the
WARN_ON_ONCE(), which this commit removes.
Making RCU's CPU_DEAD notifier adopt callbacks posted by CPU_DYING
notifiers is a topic for the 3.5 release of the Linux kernel.
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This BUG_ON() can be triggered if you call schedule_work() before
calling INIT_WORK(). It is a bug definitely, but it's nicer to just
print a stack trace and return.
Reported-by: Matt Renzelmann <mjr@cs.wisc.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The change to make tracing_on affect only the ftrace ring buffer, caused
a bug where it wont affect any ring buffer. The problem was that the buffer
of the trace_array was passed to the write function and not the trace array
itself.
The trace_array can change the buffer when running a latency tracer. If this
happens, then the buffer being disabled may not be the buffer currently used
by ftrace. This will cause the tracing_on file to become useless.
The simple fix is to pass the trace_array to the write function instead of
the buffer. Then the actual buffer may be changed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Uprobes has a callback (uprobe_munmap()) in the unmap path to
maintain the uprobes count.
In the exit path this callback gets called in unlink_file_vma().
However by the time unlink_file_vma() is called, the pages would
have been unmapped (in unmap_vmas()) and the task->rss_stat counts
accounted (in zap_pte_range()).
If the exiting process has probepoints, uprobe_munmap() checks if
the breakpoint instruction was around before decrementing the probe
count.
This results in a file backed page being reread by uprobe_munmap()
and hence it does not find the breakpoint.
This patch fixes this problem by moving the callback to
unmap_single_vma(). Since unmap_single_vma() may not unmap the
complete vma, add start and end parameters to uprobe_munmap().
This bug became apparent courtesy of commit c3f0327f8e
("mm: add rss counters consistency check").
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120411103527.23245.9835.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge in latest upstream (and the latest perf development tree),
to prepare for tooling changes, and also to pick up v3.4 MM
changes that the uprobes code needs to take care of.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull system.h fixups for less common arch's from Paul Gortmaker:
"Here is what is hopefully the last of the system.h related fixups.
The fixes for Alpha and ia64 are code relocations consistent with what
was done for the more mainstream architectures. Note that the
diffstat lines removed vs lines added are not the same since I've
fixed some of the whitespace issues in the relocated code blocks.
However they are functionally the same. Compile tested locally, plus
these two have been in linux-next for a while.
There is also a trivial one line system.h related fix for the Tilera
arch from Chris Metcalf to fix an implict include.."
* 'systemh-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
irq_work: fix compile failure on tile from missing include
ia64: populate the cmpxchg header with appropriate code
alpha: fix build failures from system.h dismemberment
Today's -next fails to link for me:
kernel/built-in.o:(.data+0x178e50): undefined reference to `perf_ftrace_event_register'
It looks like multiple fixes have been merged for the issue fixed by
commit fa73dc9 (tracing: Fix build breakage without CONFIG_PERF_EVENTS)
though I can't identify the other changes that have gone in at the
minute, it's possible that the changes which caused the breakage fixed
by the previous commit got dropped but the fix made it in.
Link: http://lkml.kernel.org/r/1334307179-21255-1-git-send-email-broonie@opensource.wolfsonmicro.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This change adds support for a new ptrace option, PTRACE_O_TRACESECCOMP,
and a new return value for seccomp BPF programs, SECCOMP_RET_TRACE.
When a tracer specifies the PTRACE_O_TRACESECCOMP ptrace option, the
tracer will be notified, via PTRACE_EVENT_SECCOMP, for any syscall that
results in a BPF program returning SECCOMP_RET_TRACE. The 16-bit
SECCOMP_RET_DATA mask of the BPF program return value will be passed as
the ptrace_message and may be retrieved using PTRACE_GETEVENTMSG.
If the subordinate process is not using seccomp filter, then no
system call notifications will occur even if the option is specified.
If there is no tracer with PTRACE_O_TRACESECCOMP when SECCOMP_RET_TRACE
is returned, the system call will not be executed and an -ENOSYS errno
will be returned to userspace.
This change adds a dependency on the system call slow path. Any future
efforts to use the system call fast path for seccomp filter will need to
address this restriction.
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
v18: - rebase
- comment fatal_signal check
- acked-by
- drop secure_computing_int comment
v17: - ...
v16: - update PT_TRACE_MASK to 0xbf4 so that STOP isn't clear on SETOPTIONS call (indan@nul.nu)
[note PT_TRACE_MASK disappears in linux-next]
v15: - add audit support for non-zero return codes
- clean up style (indan@nul.nu)
v14: - rebase/nochanges
v13: - rebase on to 88ebdda615
(Brings back a change to ptrace.c and the masks.)
v12: - rebase to linux-next
- use ptrace_event and update arch/Kconfig to mention slow-path dependency
- drop all tracehook changes and inclusion (oleg@redhat.com)
v11: - invert the logic to just make it a PTRACE_SYSCALL accelerator
(indan@nul.nu)
v10: - moved to PTRACE_O_SECCOMP / PT_TRACE_SECCOMP
v9: - n/a
v8: - guarded PTRACE_SECCOMP use with an ifdef
v7: - introduced
Signed-off-by: James Morris <james.l.morris@oracle.com>
Adds a new return value to seccomp filters that triggers a SIGSYS to be
delivered with the new SYS_SECCOMP si_code.
This allows in-process system call emulation, including just specifying
an errno or cleanly dumping core, rather than just dying.
Suggested-by: Markus Gutschke <markus@chromium.org>
Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
v18: - acked-by, rebase
- don't mention secure_computing_int() anymore
v15: - use audit_seccomp/skip
- pad out error spacing; clean up switch (indan@nul.nu)
v14: - n/a
v13: - rebase on to 88ebdda615
v12: - rebase on to linux-next
v11: - clarify the comment (indan@nul.nu)
- s/sigtrap/sigsys
v10: - use SIGSYS, syscall_get_arch, updates arch/Kconfig
note suggested-by (though original suggestion had other behaviors)
v9: - changes to SIGILL
v8: - clean up based on changes to dependent patches
v7: - introduction
Signed-off-by: James Morris <james.l.morris@oracle.com>
This change enables SIGSYS, defines _sigfields._sigsys, and adds
x86 (compat) arch support. _sigsys defines fields which allow
a signal handler to receive the triggering system call number,
the relevant AUDIT_ARCH_* value for that number, and the address
of the callsite.
SIGSYS is added to the SYNCHRONOUS_MASK because it is desirable for it
to have setup_frame() called for it. The goal is to ensure that
ucontext_t reflects the machine state from the time-of-syscall and not
from another signal handler.
The first consumer of SIGSYS would be seccomp filter. In particular,
a filter program could specify a new return value, SECCOMP_RET_TRAP,
which would result in the system call being denied and the calling
thread signaled. This also means that implementing arch-specific
support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Eric Paris <eparis@redhat.com>
v18: - added acked by, rebase
v17: - rebase and reviewed-by addition
v14: - rebase/nochanges
v13: - rebase on to 88ebdda615
v12: - reworded changelog (oleg@redhat.com)
v11: - fix dropped words in the change description
- added fallback copy_siginfo support.
- added __ARCH_SIGSYS define to allow stepped arch support.
v10: - first version based on suggestion
Signed-off-by: James Morris <james.l.morris@oracle.com>
This change adds the SECCOMP_RET_ERRNO as a valid return value from a
seccomp filter. Additionally, it makes the first use of the lower
16-bits for storing a filter-supplied errno. 16-bits is more than
enough for the errno-base.h calls.
Returning errors instead of immediately terminating processes that
violate seccomp policy allow for broader use of this functionality
for kernel attack surface reduction. For example, a linux container
could maintain a whitelist of pre-existing system calls but drop
all new ones with errnos. This would keep a logically static attack
surface while providing errnos that may allow for graceful failure
without the downside of do_exit() on a bad call.
This change also changes the signature of __secure_computing. It
appears the only direct caller is the arm entry code and it clobbers
any possible return value (register) immediately.
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
v18: - fix up comments and rebase
- fix bad var name which was fixed in later revs
- remove _int() and just change the __secure_computing signature
v16-v17: ...
v15: - use audit_seccomp and add a skip label. (eparis@redhat.com)
- clean up and pad out return codes (indan@nul.nu)
v14: - no change/rebase
v13: - rebase on to 88ebdda615
v12: - move to WARN_ON if filter is NULL
(oleg@redhat.com, luto@mit.edu, keescook@chromium.org)
- return immediately for filter==NULL (keescook@chromium.org)
- change evaluation to only compare the ACTION so that layered
errnos don't result in the lowest one being returned.
(keeschook@chromium.org)
v11: - check for NULL filter (keescook@chromium.org)
v10: - change loaders to fn
v9: - n/a
v8: - update Kconfig to note new need for syscall_set_return_value.
- reordered such that TRAP behavior follows on later.
- made the for loop a little less indent-y
v7: - introduced
Signed-off-by: James Morris <james.l.morris@oracle.com>
This consolidates the seccomp filter error logging path and adds more
details to the audit log.
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
v18: make compat= permanent in the record
v15: added a return code to the audit_seccomp path by wad@chromium.org
(suggested by eparis@redhat.com)
v*: original by keescook@chromium.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
[This patch depends on luto@mit.edu's no_new_privs patch:
https://lkml.org/lkml/2012/1/30/264
The whole series including Andrew's patches can be found here:
https://github.com/redpig/linux/tree/seccomp
Complete diff here:
https://github.com/redpig/linux/compare/1dc65fed...seccomp
]
This patch adds support for seccomp mode 2. Mode 2 introduces the
ability for unprivileged processes to install system call filtering
policy expressed in terms of a Berkeley Packet Filter (BPF) program.
This program will be evaluated in the kernel for each system call
the task makes and computes a result based on data in the format
of struct seccomp_data.
A filter program may be installed by calling:
struct sock_fprog fprog = { ... };
...
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);
The return value of the filter program determines if the system call is
allowed to proceed or denied. If the first filter program installed
allows prctl(2) calls, then the above call may be made repeatedly
by a task to further reduce its access to the kernel. All attached
programs must be evaluated before a system call will be allowed to
proceed.
Filter programs will be inherited across fork/clone and execve.
However, if the task attaching the filter is unprivileged
(!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
ensures that unprivileged tasks cannot attach filters that affect
privileged tasks (e.g., setuid binary).
There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time
- BPF optimization (and JIT'ing) are well understood
- Userland already knows its ABI: system call numbers and desired
arguments
- No time-of-check-time-of-use vulnerable data accesses are possible.
- system call arguments are loaded on access only to minimize copying
required for system call policy decisions.
Mode 2 support is restricted to architectures that enable
HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
syscall_get_arguments(). The full desired scope of this feature will
add a few minor additional requirements expressed later in this series.
Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
the desired additional functionality.
No architectures are enabled in this patch.
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reviewed-by: Indan Zupancic <indan@nul.nu>
Acked-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
v18: - rebase to v3.4-rc2
- s/chk/check/ (akpm@linux-foundation.org,jmorris@namei.org)
- allocate with GFP_KERNEL|__GFP_NOWARN (indan@nul.nu)
- add a comment for get_u32 regarding endianness (akpm@)
- fix other typos, style mistakes (akpm@)
- added acked-by
v17: - properly guard seccomp filter needed headers (leann@ubuntu.com)
- tighten return mask to 0x7fff0000
v16: - no change
v15: - add a 4 instr penalty when counting a path to account for seccomp_filter
size (indan@nul.nu)
- drop the max insns to 256KB (indan@nul.nu)
- return ENOMEM if the max insns limit has been hit (indan@nul.nu)
- move IP checks after args (indan@nul.nu)
- drop !user_filter check (indan@nul.nu)
- only allow explicit bpf codes (indan@nul.nu)
- exit_code -> exit_sig
v14: - put/get_seccomp_filter takes struct task_struct
(indan@nul.nu,keescook@chromium.org)
- adds seccomp_chk_filter and drops general bpf_run/chk_filter user
- add seccomp_bpf_load for use by net/core/filter.c
- lower max per-process/per-hierarchy: 1MB
- moved nnp/capability check prior to allocation
(all of the above: indan@nul.nu)
v13: - rebase on to 88ebdda615
v12: - added a maximum instruction count per path (indan@nul.nu,oleg@redhat.com)
- removed copy_seccomp (keescook@chromium.org,indan@nul.nu)
- reworded the prctl_set_seccomp comment (indan@nul.nu)
v11: - reorder struct seccomp_data to allow future args expansion (hpa@zytor.com)
- style clean up, @compat dropped, compat_sock_fprog32 (indan@nul.nu)
- do_exit(SIGSYS) (keescook@chromium.org, luto@mit.edu)
- pare down Kconfig doc reference.
- extra comment clean up
v10: - seccomp_data has changed again to be more aesthetically pleasing
(hpa@zytor.com)
- calling convention is noted in a new u32 field using syscall_get_arch.
This allows for cross-calling convention tasks to use seccomp filters.
(hpa@zytor.com)
- lots of clean up (thanks, Indan!)
v9: - n/a
v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
- Lots of fixes courtesy of indan@nul.nu:
-- fix up load behavior, compat fixups, and merge alloc code,
-- renamed pc and dropped __packed, use bool compat.
-- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
dependencies
v7: (massive overhaul thanks to Indan, others)
- added CONFIG_HAVE_ARCH_SECCOMP_FILTER
- merged into seccomp.c
- minimal seccomp_filter.h
- no config option (part of seccomp)
- no new prctl
- doesn't break seccomp on systems without asm/syscall.h
(works but arg access always fails)
- dropped seccomp_init_task, extra free functions, ...
- dropped the no-asm/syscall.h code paths
- merges with network sk_run_filter and sk_chk_filter
v6: - fix memory leak on attach compat check failure
- require no_new_privs || CAP_SYS_ADMIN prior to filter
installation. (luto@mit.edu)
- s/seccomp_struct_/seccomp_/ for macros/functions (amwang@redhat.com)
- cleaned up Kconfig (amwang@redhat.com)
- on block, note if the call was compat (so the # means something)
v5: - uses syscall_get_arguments
(indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
- uses union-based arg storage with hi/lo struct to
handle endianness. Compromises between the two alternate
proposals to minimize extra arg shuffling and account for
endianness assuming userspace uses offsetof().
(mcgrathr@chromium.org, indan@nul.nu)
- update Kconfig description
- add include/seccomp_filter.h and add its installation
- (naive) on-demand syscall argument loading
- drop seccomp_t (eparis@redhat.com)
v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
- now uses current->no_new_privs
(luto@mit.edu,torvalds@linux-foundation.com)
- assign names to seccomp modes (rdunlap@xenotime.net)
- fix style issues (rdunlap@xenotime.net)
- reworded Kconfig entry (rdunlap@xenotime.net)
v3: - macros to inline (oleg@redhat.com)
- init_task behavior fixed (oleg@redhat.com)
- drop creator entry and extra NULL check (oleg@redhat.com)
- alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
- adds tentative use of "always_unprivileged" as per
torvalds@linux-foundation.org and luto@mit.edu
v2: - (patch 2 only)
Signed-off-by: James Morris <james.l.morris@oracle.com>
With this change, calling
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time. For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set. The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged. For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use. It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris <james.l.morris@oracle.com>
Building with IRQ_WORK configured results in
kernel/irq_work.c: In function ‘irq_work_run’:
kernel/irq_work.c:110: error: implicit declaration of function ‘irqs_disabled’
The appropriate header just needs to be included.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Format string bug fix for irqdomain debug output on 64 bit platforms
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPh1bxAAoJEEFnBt12D9kBbJYP/1QVBjTuObbdoI4UQ8TMTueO
6Wh0hZ0zxRF+lznPJKJQdurIKJBtgM2m+M+HZl1fIrhIQbwzASc3whztqc30n1rj
qnqjzeGAdQv8NWvABgjZJM0s8SuCFlwvnm0BfdXGwe4Uh/E761rs3oz0YtZcXUC8
XXiWjY6FNExZ8dKFv94SDmFS8FGjz72gQW5rGB8wtyD/sl7rs59K6h2eOBm5HhUT
DDjsIlyUGev7QYMJNFRfYDEFKBXFH63v1q69kroOxEgd2CMwD2WfAguwBdFKhOrF
aWfOUJZaOkglGOfeGulEs0lohgWeehSZYwKNTDZh/FPqQmSXhixN9PIc4iYBXlqa
ZgyUF3Tt3BQ+s8rHNTk1psWxvzYvHzfA6+KGRdPmZ9fOmqdfCoAfj2wh5oWmSKsJ
ZwQygeU1ziI/deBRVL08qW1NeeYFf3iGY68wIV338XGBmMYxpWwzWjoWO3nKkSxm
nMUiiOyEVulLdzeXy+JCL39IbZ8atiDj/012CIsiHhssZRtoQNt2wyBxZ4yditze
6gZWtak1dn9ZAIZiGfzPh5SbbPOGjykqt0VSoyhKU1XEVAsGByWwZvqLFWwMRSeD
KIKp7zINy5p/ftoBhe3dKgDNw83FJF+IqubK5k6m/AtDY14WOoUbVIkfbxXhhXLK
Di5uxHBxhRIL5jhj1v27
=7g0b
-----END PGP SIGNATURE-----
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull a fix for the recent irqdomain bug fixes from Grant Likely:
"I flubbed one patch in the last pull request which broke a format
string on 64 bit platforms. Here's the fix."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
irq_domain: fix type mismatch in debugfs output format
sizeof(void*) returns an unsigned long, but it was being used as a width parameter to a "%-*s" format string which requires an int. On 64 bit platforms this causes a type mismatch:
linux/kernel/irq/irqdomain.c:575: warning: field width should have type
'int', but argument 6 has type 'long unsigned int'
This change casts the size to an int so printf gets the right data type.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: David Daney <david.daney@cavium.com>
Pull timer fixes from Thomas Gleixner:
"The itimer removal one is not strictly a fix, but I really wanted to
avoid a rebase of the urgent ones."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "clocksource: Load the ACPI PM clocksource asynchronously"
clockevents: tTack broadcast device mode change in tick_broadcast_switch_to_oneshot()
itimer: Use printk_once instead of WARN_ONCE
nohz: Fix stale jiffies update in tick_nohz_restart()
tick: Document TICK_ONESHOT config option
proc: stats: Use arch_idle_time for idle and iowait times if available
itimer: Schedule silent NULL pointer fixup in setitimer() for removal
Merge fixes from Andrew Morton.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (14 patches)
panic: fix stack dump print on direct call to panic()
drivers/rtc/rtc-pl031.c: enable clock on all ST variants
Revert "mm: vmscan: fix misused nr_reclaimed in shrink_mem_cgroup_zone()"
hugetlb: fix race condition in hugetlb_fault()
drivers/rtc/rtc-twl.c: use static register while reading time
drivers/rtc/rtc-s3c.c: add placeholder for driver private data
drivers/rtc/rtc-s3c.c: fix compilation error
MAINTAINERS: add PCDP console maintainer
memcg: do not open code accesses to res_counter members
drivers/rtc/rtc-efi.c: fix section mismatch warning
drivers/rtc/rtc-r9701.c: reset registers if invalid values are detected
drivers/char/random.c: fix boot id uniqueness race
memcg: fix broken boolen expression
memcg: fix up documentation on global LRU
Migration status depends on a difference of weight from 0 and 1.
If weight > 1 (<= 1) and old weight <= 1 (> 1) then task becomes
pushable (or not pushable). We are not insterested in its exact
values, is it 3 or 4, for example.
Now if we are changing affinity from a set of 3 cpus to a set of 4, the-
task will be dequeued and enqueued sequentially without important
difference in comparison with initial state. The only difference is in
internal representation of plist queue of pushable tasks and the fact
that the task may won't be the first in a sequence of the same priority
tasks. But it seems to me it gives nothing.
Link: http://lkml.kernel.org/r/273741334120764@web83.yandex.ru
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tkhai Kirill <tkhai@yandex.ru>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 6e6f0a1f0f ("panic: don't print redundant backtraces on oops")
causes a regression where no stack trace will be printed at all for the
case where kernel code calls panic() directly while not processing an
oops, and of course there are 100's of instances of this type of call.
The original commit executed the check (!oops_in_progress), but this will
always be false because just before the dump_stack() there is a call to
bust_spinlocks(1), which does the following:
void __attribute__((weak)) bust_spinlocks(int yes)
{
if (yes) {
++oops_in_progress;
The proper way to resolve the problem that original commit tried to
solve is to avoid printing a stack dump from panic() when the either of
the following conditions is true:
1) TAINT_DIE has been set (this is done by oops_end())
This indicates and oops has already been printed.
2) oops_in_progress > 1
This guards against the rare case where panic() is invoked
a second time, or in between oops_begin() and oops_end()
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org> [3.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This branch fixes a bug in irq_create_mapping() where an error return
from irq_alloc_desc_from() gets ignored. It also removes irq_virq_count
to fix a bug on powerpc where the irqdomain code does not find irqs
allocated above the CONFIG_NR_IRQS boundary. The remaining patches get
rid of an completely pointless export and fix some minor bugs in the
irqdomain debug output.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPhni4AAoJEEFnBt12D9kBA/cP/jv3ENYDy2/g1/eE6W1aSkUf
/7FlfpXsufS0Bl+wfk7sN8D1NLoB/36bLVU0TStup90vL03WT9A+BHl9tjogpZVz
oDuLFYHSuVVOK40SSrcnOUc6rncKAni9tGjVjFCxVAx3FlqebTHWDu/Cl4BAaWBo
+j2u4HHelHgr8oXCY5avWS0cOn3L7rIoJ54/Jqpn10OooqH2cgz9xYMb+1/ORfz1
xjpJ4OiXKnSvuG7WD0S1EKPMbaiyak+jBoHYYNpEOriTMtcOTNg5hjz7b3jDfOrm
gkNReffdDXCnsCPj/1gEhJlB4i+iTES0lTBVfOZ8M2luhF6wuGUYeRaiy+/m00DZ
qYFXD5TaVM0+2USCeo71DPfag8now6YrJNIv93CGEY0fLGDJJg2yJI3oUN728p9a
E88JLPs8f//8rxQaBatGtHmReD4wKwCevciVekSWZSROnPxnIP8PvBPq8e4Bf04r
q+VBmr+gJh+oaDAZrIaRPsRCidHhwzIrexa4cv7rt84vnx2Hltq75ijaPNlR3JU7
FFhZj1l8185HxXEsTJHEmiKN0J/drVIu/beGgHD7NbWWIdt8tqgtNOEUudVTisfM
VgBdgjjbKFwQDuOxgaYgERwCkb1YXFT/kDKpgKaYnxl0yGaALjxO+ISd2fIJOuKO
fzeVN4LDvVCysAQ/SeOG
=6Ejq
-----END PGP SIGNATURE-----
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull irqdomain bug fixes from Grant Likely:
"This branch fixes a bug in irq_create_mapping() where an error return
from irq_alloc_desc_from() gets ignored.
It also removes irq_virq_count to fix a bug on powerpc where the
irqdomain code does not find irqs allocated above the CONFIG_NR_IRQS
boundary.
The remaining patches get rid of an completely pointless export and
fix some minor bugs in the irqdomain debug output."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
irq_domain: Move irq_virq_count into NOMAP revmap
irqdomain: Fix debugfs formatting
irq_domain: correct the debugfs file name
irq: Kill pointless irqd_to_hw export
irq/irq_domain: Quit ignoring error returns from irq_alloc_desc_from().
This patch replaces the old global setting of irq_virq_count that is only
used by the NOMAP mapping and instead uses a revmap_data property so that
the maximum NOMAP allocation can be set per NOMAP irq_domain.
There is exactly one user of irq_virq_count in-tree right now: PS3.
Also, irq_virq_count is only useful for the NOMAP mapping. So,
instead of having a single global irq_virq_count values, this change
drops it entirely and added a max_irq argument to irq_domain_add_nomap().
That makes it a property of an individual nomap irq domain instead of
a global system settting.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
kfifo_avail expands to __kfifo_max_r which is not an exported symbol.
Any kernel module using kfifo_avail will result in build failures
because of this.
This patch just exports __kfifo_max_r symbol to fix such problems in
future.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This tracer was temporarily removed in 6416669 (workqueue:
temporarily remove workqueue tracing, 2010-06-29) but never
reinstated after concurrency managed workqueues were completed.
For almost two years it hasn't been compilable so it seems nobody
is using it. Delete it.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With memcg converted, cgroup_subsys->populate() doesn't have any user
left. Remove it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
keyctl_session_to_parent(task) sets ->replacement_session_keyring,
it should be processed and cleared by key_replace_session_keyring().
However, this task can fork before it notices TIF_NOTIFY_RESUME and
the new child gets the bogus ->replacement_session_keyring copied by
dup_task_struct(). This is obviously wrong and, if nothing else, this
leads to put_cred(already_freed_cred).
change copy_creds() to clear this member. If copy_process() fails
before this point the wrong ->replacement_session_keyring doesn't
matter, exit_creds() won't be called.
Cc: <stable@vger.kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes the irq_domain_mapping debugfs output to pad pointer
values with leading zeros so that pointer values are displayed
correctly. Otherwise you get output similar to "0x 5e0000000000000".
Also, when the irq_domain is set to 'null'
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: David Daney <david.daney@cavium.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
The actual name of the irq_domain mapping debugfs file is
"irq_domain_mapping" not "virq_mapping".
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
In commit 4bbdd45a (irq_domain/powerpc: eliminate irq_map; use
irq_alloc_desc() instead) code was added that ignores error returns
from irq_alloc_desc_from() by (silently) casting the return value to
unsigned. The negitive value error return now suddenly looks like a
valid irq number.
Commits cc79ca69 (irq_domain: Move irq_domain code from powerpc to
kernel/irq) and 1bc04f2c (irq_domain: Add support for base irq and
hwirq in legacy mappings) move this code to its current location in
irqdomain.c
The result of all of this is a null pointer dereference OOPS if one of
the error cases is hit.
The fix: Don't cast away the negativeness of the return value and then
check for errors.
Signed-off-by: David Daney <david.daney@cavium.com>
Acked-by: Rob Herring <rob.herring@calxeda.com>
[grant.likely: dropped addition of new 'irq' variable]
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
In the commit 77b0d60c5a,
"clockevents: Leave the broadcast device in shutdown mode when not needed",
we were bailing out too quickly in tick_broadcast_switch_to_oneshot(),
with out tracking the broadcast device mode change to 'TICKDEV_MODE_ONESHOT'.
This breaks the platforms which need broadcast device oneshot services during
deep idle states. tick_broadcast_oneshot_control() thinks that it is
in periodic mode and fails to take proper decisions based on the
CLOCK_EVT_NOTIFY_BROADCAST_[ENTER, EXIT] notifications during deep
idle entry/exit.
Fix this by tracking the broadcast device mode as 'TICKDEV_MODE_ONESHOT',
before leaving the broadcast HW device in shutdown mode if there are no active
requests for the moment.
Reported-and-tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: johnstul@us.ibm.com
Link: http://lkml.kernel.org/r/1334011304.12400.81.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
David pointed out, that WARN_ONCE() to report usage of an deprecated
misfeature make folks unhappy. Use printk_once() instead.
Andrew told me to stop grumbling and to remove the silly typecast
while touching the file.
Reported-by: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merge with latest Linus' tree, as I have incoming patches
that fix code that is newer than current HEAD of for-next.
Conflicts:
drivers/net/ethernet/realtek/r8169.c
A suspended VM can cause spurious soft lockup warnings. To avoid these, the
watchdog now checks if the kernel knows it was stopped by the host and skips
the warning if so. When the watchdog is reset successfully, clear the guest
paused flag.
Signed-off-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Modify alloc_uid to take a kuid and make the user hash table global.
Stop holding a reference to the user namespace in struct user_struct.
This simplifies the code and makes the per user accounting not
care about which user namespace a uid happens to appear in.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Start distinguishing between internal kernel uids and gids and
values that userspace can use. This is done by introducing two
new types: kuid_t and kgid_t. These types and their associated
functions are infrastructure are declared in the new header
uidgid.h.
Ultimately there will be a different implementation of the mapping
functions for use with user namespaces. But to keep it simple
we introduce the mapping functions first to separate the meat
from the mechanical code conversions.
Export overflowuid and overflowgid so we can use from_kuid_munged
and from_kgid_munged in modular code.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This represents a change in strategy of how to handle user namespaces.
Instead of tagging everything explicitly with a user namespace and bulking
up all of the comparisons of uids and gids in the kernel, all uids and gids
in use will have a mapping to a flat kuid and kgid spaces respectively. This
allows much more of the existing logic to be preserved and in general
allows for faster code.
In this new and improved world we allow someone to utiliize capabilities
over an inode if the inodes owner mapps into the capabilities holders user
namespace and the user has capabilities in their user namespace. Which
is simple and efficient.
Moving the fs uid comparisons to be comparisons in a flat kuid space
follows in later patches, something that is only significant if you
are using user namespaces.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
With a user_ns reference in struct cred the only user of the user namespace
reference in struct user_struct is to keep the uid hash table alive.
The user_namespace reference in struct user_struct will be going away soon, and
I have removed all of the references. Rename the field from user_ns to _user_ns
so that the compiler can verify nothing follows the user struct to the user
namespace anymore.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
I am about to remove the struct user_namespace reference from struct user_struct.
So keep an explicit track of the parent user namespace.
Take advantage of this new reference and replace instances of user_ns->creator->user_ns
with user_ns->parent.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
struct user_struct will shortly loose it's user_ns reference
so make the cred user_ns reference a proper reference complete
with reference counting.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Optimize performance and prepare for the removal of the user_ns reference
from user_struct. Remove the slow long walk through cred->user->user_ns and
instead go straight to cred->user_ns.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
In struct cred the user member is and has always been declared struct user_struct *user.
At most a constant struct cred will have a constant pointer to non-constant user_struct
so remove this unnecessary cast.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Pull security layer fixlet from James Morris.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
sysctl: fix write access to dmesg_restrict/kptr_restrict
Fix tick_nohz_restart() to not use a stale ktime_t "now" value when
calling tick_do_update_jiffies64(now).
If we reach this point in the loop it means that we crossed a tick
boundary since we grabbed the "now" timestamp, so at this point "now"
refers to a time in the old jiffy, so using the old value for "now" is
incorrect, and is likely to give us a stale jiffies value.
In particular, the first time through the loop the
tick_do_update_jiffies64(now) call is always a no-op, since the
caller, tick_nohz_restart_sched_tick(), will have already called
tick_do_update_jiffies64(now) with that "now" value.
Note that tick_nohz_stop_sched_tick() already uses the correct
approach: when we notice we cross a jiffy boundary, grab a new
timestamp with ktime_get(), and *then* update jiffies.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1332875377-23014-1-git-send-email-ncardwell@google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merge batch of fixes from Andrew Morton:
"The simple_open() cleanup was held back while I wanted for laggards to
merge things.
I still need to send a few checkpoint/restore patches. I've been
wobbly about merging them because I'm wobbly about the overall
prospects for success of the project. But after speaking with Pavel
at the LSF conference, it sounds like they're further toward
completion than I feared - apparently davem is at the "has stopped
complaining" stage regarding the net changes. So I need to go back
and re-review those patchs and their (lengthy) discussion."
* emailed from Andrew Morton <akpm@linux-foundation.org>: (16 patches)
memcg swap: use mem_cgroup_uncharge_swap fix
backlight: add driver for DA9052/53 PMIC v1
C6X: use set_current_blocked() and block_sigmask()
MAINTAINERS: add entry for sparse checker
MAINTAINERS: fix REMOTEPROC F: typo
alpha: use set_current_blocked() and block_sigmask()
simple_open: automatically convert to simple_open()
scripts/coccinelle/api/simple_open.cocci: semantic patch for simple_open()
libfs: add simple_open()
hugetlbfs: remove unregister_filesystem() when initializing module
drivers/rtc/rtc-88pm860x.c: fix rtc irq enable callback
fs/xattr.c:setxattr(): improve handling of allocation failures
fs/xattr.c:listxattr(): fall back to vmalloc() if kmalloc() failed
fs/xattr.c: suppress page allocation failure warnings from sys_listxattr()
sysrq: use SEND_SIG_FORCED instead of force_sig()
proc: fix mount -t proc -o AAA
Many users of debugfs copy the implementation of default_open() when
they want to support a custom read/write function op. This leads to a
proliferation of the default_open() implementation across the entire
tree.
Now that the common implementation has been consolidated into libfs we
can replace all the users of this function with simple_open().
This replacement was done with the following semantic patch:
<smpl>
@ open @
identifier open_f != simple_open;
identifier i, f;
@@
-int open_f(struct inode *i, struct file *f)
-{
(
-if (i->i_private)
-f->private_data = i->i_private;
|
-f->private_data = i->i_private;
)
-return 0;
-}
@ has_open depends on open @
identifier fops;
identifier open.open_f;
@@
struct file_operations fops = {
...
-.open = open_f,
+.open = simple_open,
...
};
</smpl>
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit bfdc0b4 adds code to restrict access to dmesg_restrict,
however, it incorrectly alters kptr_restrict rather than
dmesg_restrict.
The original patch from Richard Weinberger
(https://lkml.org/lkml/2011/3/14/362) alters dmesg_restrict as
expected, and so the patch seems to have been misapplied.
This adds the CAP_SYS_ADMIN check to both dmesg_restrict and
kptr_restrict, since both are sensitive.
Reported-by: Phillip Lougher <plougher@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Richard Weinberger <richard@nod.at>
Cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
3.4: Fix an an Smatch warning that appeared in the 3.4 merge window
3.0: Fix kgdb test suite with SMP for all archs without HW single stepping
2.6.36: Fix kgdb sw breakpoints with CONFIG_DEBUG_RODATA=y limitations on x86
2.6.26: Fix oops on kgdb test suite with CONFIG_DEBUG_RODATA
Fix kgdb test suite with SMP for all archs with HW single stepping
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPedocAAoJEIciOldedpOjn7EP/397Rh0zmRlG8oQwMEJcK3E5
gaRyBNpkGoU3ekHXHx/nzgQ/CS9opzW7nBZDu8weWLjRKMT4RyHfuJcWyu525GvQ
SnoiX2ZUzP315d8llCYwXmaCEYA7lHQi4T2bGMlDSn1J8kS235EQxllgEfhXDdEC
DxRWgHABG2UR62C62sGKbPaMMDO9TcNcrAQK27LDLTS7pKLmYqBWBdZKgWzBM/Pr
AF8vakqSgUw3Aq9qrLge+483uT7uhMoUJofxRppWtm1QgnDcTmri9LOagiazDotz
RQliRGwVxj9hEo5mLEiQtI0N1kIGCAsK0+9aUJEZRXovRBR9kvqaqHT4c5xdhznr
VKYvqqTcHBkKLIfNXFvQZnn2cXtNVNqve9CZZwdBJaFYEkaR7ZVQqE6f2xq8KAb2
RmhvzlEUyLU+89YKkH66uSa22VLSazkeH+4b8AJ4JxYDEab3BHoBCe8axcBQrTsj
7X5NOs7V3Oj+4J3bS1fbUbxq4t0dfpLLyg8e/lELWtT+Kq7nQRzA2XHRZAMTve8M
T0cTdrwtUbgY9ZMTpywNB2KlPgTvhWOyfYbH6/Kcks7ecSXlkow3edXoiUbw79iE
hP8vcMWbT2Rv3IbLkSMFZEQGAG9qL1YyGv4NDmLOoljO1c/Bi3WQIR5aI+di6asV
Z5q5s/bmGa4+OhFFITSd
=SW2N
-----END PGP SIGNATURE-----
Merge tag 'for_linus-3.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull KGDB/KDB regression fixes from Jason Wessel:
- Fix a Smatch warning that appeared in the 3.4 merge window
- Fix kgdb test suite with SMP for all archs without HW single stepping
- Fix kgdb sw breakpoints with CONFIG_DEBUG_RODATA=y limitations on x86
- Fix oops on kgdb test suite with CONFIG_DEBUG_RODATA
- Fix kgdb test suite with SMP for all archs with HW single stepping
* tag 'for_linus-3.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
x86,kgdb: Fix DEBUG_RODATA limitation using text_poke()
kgdb,debug_core: pass the breakpoint struct instead of address and memory
kgdbts: (2 of 2) fix single step awareness to work correctly with SMP
kgdbts: (1 of 2) fix single step awareness to work correctly with SMP
kgdbts: Fix kernel oops with CONFIG_DEBUG_RODATA
kdb: Fix smatch warning on dbg_io_ops->is_console
Fixes mostly, including:
* Patch series that hopefully fixes races between the freezer and request_firmware()
and request_firmware_nowait() for good, with two cleanups from Stephen Boyd on top.
* Runtime PM fix from Alan Stern preventing tasks from getting stuck indefinitely
in the runtime PM wait queue.
* Device PM QoS update from MyungJoo Ham introducing a new variant of
pm_qos_update_request() allowing the callers to specify a timeout.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJPdmPZAAoJEKhOf7ml8uNsvcgQAIKBya3ESVg2PbB1riIRJ0M5
3R5ntbQ0sxa631lIoipZLP6HeN2fgTcfTqhHpr9/dtt80Zh/HbNWee4XEmkJvGOK
UuG/Vzg2IJA2LKYbRDEALm9GwvlG8ylIrz1mWOSt77K+seyjnvCyfQsoVd5S/+sz
bzDCwIJlV/lvtynvAMfaZ+O75XW1uYRJ6a1ABviEU4o+J7OC9UCp0h/b9c1WZqDJ
1X0pBU0/28ZFnYnK+zuAqwJg7pua/HrC0nT/pQTRSZ0kXAgt7uuqIlpVz9HXiqzu
TVbu3uW6FPWT0TP/iFmKMA1eiQJHLXgshECaccVOoMzIG/pqYTNbfu9BzEho3tL9
w716ruo1JoythvnlIz4j8R2RtiE8SxTzCqGm4OHcie72VUSqduIhWgRyZOFhebUo
xqiUSN2cyYUf9SJoeg0TSmQdutoa7vnswZgq4qjlOz39OPxHrwAe5ROXIBwoHvnz
akmBtnabyNVsRiLe9eIH5N5C9TxHDgZwS70SMYqo1D09Qo+NTUtvSVgC/NiIjhXb
yY3UliDqGlkUhHJ+8ydntNb39VU4L1MO0IGzEvmvfXvSIcXavGkkmd9RV9yytLEK
1ujq99NHITzxyuF2+bNGpPQVEVH3sQgAv/doFTiEZiUHIIAy5Fmy/+ipcurslXLm
urlq4RLG+JXgPjw4XO14
=ligR
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.4-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
- Patch series that hopefully fixes races between the freezer and
request_firmware() and request_firmware_nowait() for good, with two
cleanups from Stephen Boyd on top.
- Runtime PM fix from Alan Stern preventing tasks from getting stuck
indefinitely in the runtime PM wait queue.
- Device PM QoS update from MyungJoo Ham introducing a new variant of
pm_qos_update_request() allowing the callers to specify a timeout.
* tag 'pm-for-3.4-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / QoS: add pm_qos_update_request_timeout() API
firmware_class: Move request_firmware_nowait() to workqueues
firmware_class: Reorganize fw_create_instance()
PM / Sleep: Mitigate race between the freezer and request_firmware()
PM / Sleep: Move disabling of usermode helpers to the freezer
PM / Hibernate: Disable usermode helpers right before freezing tasks
firmware_class: Do not warn that system is not ready from async loads
firmware_class: Split _request_firmware() into three functions, v2
firmware_class: Rework usermodehelper check
PM / Runtime: don't forget to wake up waitqueue on failure
This merges some of the fixes from Paul Gortmaker for the header file
cleanup fallout.
Some of the patches are going through arch maintainer trees, and David
Howells suggested another be done differently, but this at least fixes a
few cases.
* emailed from Paul Gortmaker <paul.gortmaker@windriver.com>:
asm-generic: add linux/types.h to cmpxchg.h
firewire: restore the device.h include in linux/firewire.h
frv: fix warnings in mb93090-mb00/pci-dma.c about implicit EXPORT_SYMBOL
parisc: fix missing cmpxchg file error from system.h split
blackfin: fix cmpxchg build fails from system.h fallout
avr32: fix build failures from mis-naming of atmel_nand.h
ARM: mach-msm: fix compile fail from system.h fallout
irq_work: fix compile failure on MIPS from system.h split
Pull crypto fixes from Herbert Xu:
- Fix for CPU hotplug hang in padata.
- Avoid using cpu_active inappropriately in pcrypt and padata.
- Fix for user-space algorithm lookup hang with IV generators.
- Fix for netlink dump of algorithms where stuff went missing due to
incorrect calculation of message size.
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: user - Fix size of netlink dump message
crypto: user - Fix lookup of algorithms with IV generator
crypto: pcrypt - Use the online cpumask as the default
padata: Fix cpu hotplug
padata: Use the online cpumask as the default
padata: Add a reference to the api documentation
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPc+5PAAoJENkgDmzRrbjx8qwQAIRGDWGAJ7fiu8QBVbjycXJG
7828enxrbBQodNmc+uAkYvTv3KEoi8tlweMsk/lWDv8WovZV4IlQDEFCX/f4hWVY
S+2PmqJkN/alsG3dXd00zotK9mOJD+mQPAdjUBaNnRdp3QoV3YrjgihkWiL23DyT
dZTgqXdbUJkHk/d9YD1qcDvWdSr1EufSLYa52PhLJqYiYVk8zCdX82deJX1MWh64
v9I6htA73ORoX4JBGsFAOHO8fmLaq1yhBUMHOL4+gfEJVv4kSTU05GgepBHQP1fm
BbG2hN6G4vqqiqhV5A59+h271o/2d/KBGKx8/twRGk8tNJIwTIVnr/qcGuUfytC3
vA1fmq3vul0bzbqRgph8bGJyoVIg8CHjq24BFJQOXiQ1/6HOvjxnKBYs+3sVA829
ZYQYuEoRKmTsD3vv3nmcqAdZZDzehBQ499bEqDNsnQRLOjOVNag/pJSaENkeVC4T
CKYXt9BEabYnermPLdrjiabPE27GaEznX11SzCSXiWJsKX2kJnvz5RxVo8nlh1fc
/KQWJyWi/QVmAdy4eCJFp48513BqncHvKtPZ6zN9+Y6NHKmnmAqieZhh4yV/SCqi
EcK2oHQXmioKldn5DANQjeUCWlmEYXHbR08ahGRLNc7GZ1qKCgDr8+WEC0XYB/gQ
XLH3KKLM+VmvtonqjDV7
=W59/
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://github.com/rustyrussell/linux
Pull cpumask cleanups from Rusty Russell:
"(Somehow forgot to send this out; it's been sitting in linux-next, and
if you don't want it, it can sit there another cycle)"
I'm a sucker for things that actually delete lines of code.
Fix up trivial conflict in arch/arm/kernel/kprobes.c, where Rusty fixed
a user of &cpu_online_map to be cpu_online_mask, but that code got
deleted by commit b21d55e98a ("ARM: 7332/1: extract out code patch
function from kprobes").
* tag 'for-linus' of git://github.com/rustyrussell/linux:
cpumask: remove old cpu_*_map.
documentation: remove references to cpu_*_map.
drivers/cpufreq/db8500-cpufreq: remove references to cpu_*_map.
remove references to cpu_*_map in arch/
Builds of the MIPS platform ip32_defconfig fails as of commit
0195c00244 ("Merge tag 'split-asm_system_h ...") because MIPS xchg()
macro uses BUILD_BUG_ON and it was moved in commit b81947c646
("Disintegrate asm/system.h for MIPS").
The root cause is that the system.h split wasn't tested on a baseline
with commit 6c03438ede ("kernel.h: doesn't explicitly use bug.h, so
don't include it.")
Since this file uses BUG code in several other places besides the xchg
call, simply make the inclusion explicit.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cgroup/for-3.5 contains the following changes which blk-cgroup needs
to proceed with the on-going cleanup.
* Dynamic addition and removal of cftypes to make config/stat file
handling modular for policies.
* cgroup removal update to not wait for css references to drain to fix
blkcg removal hang caused by cfq caching cfqgs.
Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the
following conflicts in block/blk-cgroup.c.
* 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
removal. Resolved by removing @subsys from all subsys methods.
* 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
controllers" conflicts with ->pre_destroy() and ->attach() updates
and removal of modular config. Resolved by dropping forward
declarations of the methods and applying updates to the relocated
blkio_subsys.
* 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
cftype interface" builds upon the previous item. Resolved by adding
->base_cftypes to the relocated blkio_subsys.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, cgroup removal tries to drain all css references. If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir). To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior. This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch. Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
When a cgroup is about to be removed, cgroup_clear_css_refs() is
called to check and ensure that there are no active css references.
This is currently achieved by dropping the refcnt to zero iff it has
only the base ref. If all css refs could be dropped to zero, ref
clearing is successful and CSS_REMOVED is set on all css. If not, the
base ref is restored. While css ref is zero w/o CSS_REMOVED set, any
css_tryget() attempt on it busy loops so that they are atomic
w.r.t. the whole css ref clearing.
This does work but dropping and re-instating the base ref is somewhat
hairy and makes it difficult to add more logic to the put path as
there are two of them - the regular css_put() and the reversible base
ref clearing.
This patch updates css ref clearing such that blocking new
css_tryget() and putting the base ref are separate operations.
CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and
css_tryget() busy loops while refcnt is negative. After all css refs
are deactivated, if they were all one, ref clearing succeeded and
CSS_REMOVED is set and the base ref is put using the regular
css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts
and the original postive values are restored.
css_refcnt() accessor which always returns the unbiased positive
reference counts is added and used to simplify refcnt usages. While
at it, relocate and reformat comments in cgroup_has_css_refs().
This separates css->refcnt deactivation and putting the base ref,
which enables the next patch to make ref clearing optional.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Implement cgroup_rm_cftypes() which removes an array of cftypes from a
subsystem. It can be called whether the target subsys is attached or
not. cgroup core will remove the specified file from all existing
cgroups.
This will be used to improve sub-subsys modularity and will be helpful
for unified hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
This patch adds cfent (cgroup file entry) which is the association
between a cgroup and a file. This is in-cgroup representation of
files under a cgroup directory. This simplifies walking walking
cgroup files and thus cgroup_clear_directory(), which is now
implemented in two parts - cgroup_rm_file() and a loop around it.
cgroup_rm_file() will be used to implement cftype removal and cfent is
scheduled to serve cgroup specific per-file data (e.g. for sysfs-like
"sever" semantics).
v2: - cfe was freed from cgroup_rm_file() which led to use-after-free
if the file had openers at the time of removal. Moved to
cgroup_diput().
- cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs
wasn't empty after removing all files. This triggered
spuriously if some files were open during directory clearing.
Removed.
v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be
spuriously triggered for root cgroups because they don't go
through cgroup_clear_directory() on unmount. Don't trigger WARN
for root cgroups.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
No controller is using cgroup_add_files[s](). Unexport them, and
convert cgroup_add_files() to handle NULL entry terminated array
instead of taking count explicitly and continue creation on failure
for internal use.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.
This is functionally identical transformation. There shouldn't be any
visible behavior change.
memcg is rather special and will be converted separately.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
Now that cftype can express whether a file should only be on root,
cft_release_agent can be merged into the base files cftypes array.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Currently, cgroup directories are populated by subsys->populate()
callback explicitly creating files on each cgroup creation. This
level of flexibility isn't needed or desirable. It provides largely
unused flexibility which call for abuses while severely limiting what
the core layer can do through the lack of structure and conventions.
Per each cgroup file type, the only distinction that cgroup users is
making is whether a cgroup is root or not, which can easily be
expressed with flags.
This patch introduces cgroup_add_cftypes(). These deal with cftypes
instead of individual files - controllers indicate that certain types
of files exist for certain subsystem. Newly added CFTYPE_*_ON_ROOT
flags indicate whether a cftype should be excluded or created only on
the root cgroup.
cgroup_add_cftypes() can be called any time whether the target
subsystem is currently attached or not. cgroup core will create files
on the existing cgroups as necessary.
Also, cgroup_subsys->base_cftypes is added to ease registration of the
base files for the subsystem. If non-NULL on subsys init, the cftypes
pointed to by ->base_cftypes are automatically registered on subsys
init / load.
Further patches will convert the existing users and remove the file
based interface. Note that this interface allows dynamic addition of
files to an active controller. This will be used for sub-controller
modularity and unified hierarchy in the longer term.
This patch implements the new mechanism but doesn't apply it to any
user.
v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with
cgroup_subsys->base_cftypes, which works better for cgroup_subsys
which is loaded as module.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Build a list of all cgroups anchored at cgroupfs_root->allcg_list and
going through cgroup->allcg_node. The list is protected by
cgroup_mutex and will be used to improve cgroup file handling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
cgroup_populate_dir() currently clears all files and then repopulate
the directory; however, the clearing part is only useful when it's
called from cgroup_remount(). Relocate the invocation to
cgroup_remount().
This is to prepare for further cgroup file handling updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
This patch marks the following features for deprecation.
* Rebinding subsys by remount: Never reached useful state - only works
on empty hierarchies.
* release_agent update by remount: release_agent itself will be
replaced with conventional fsnotify notification.
v2: Lennart pointed out that "name=" is necessary for mounts w/o any
controller attached. Drop "name=" deprecation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix incorrect usage of for_each_cpu_mask() in select_fallback_rq()
sched: Fix __schedule_bug() output when called from an interrupt
sched/arch: Introduce the finish_arch_post_lock_switch() scheduler callback
Pull perf updates and fixes from Ingo Molnar:
"It's mostly fixes, but there's also two late items:
- preliminary GTK GUI support for perf report
- PMU raw event format descriptors in sysfs, to be parsed by tooling
The raw event format in sysfs is a new ABI. For example for the 'CPU'
PMU we have:
aldebaran:~> ll /sys/bus/event_source/devices/cpu/format/*
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/any
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/cmask
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/edge
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/event
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/inv
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/offcore_rsp
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/pc
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/umask
those lists of fields contain a specific format:
aldebaran:~> cat /sys/bus/event_source/devices/cpu/format/offcore_rsp
config1:0-63
So, those who wish to specify raw events can now use the following
event format:
-e cpu/cmask=1,event=2,umask=3
Most people will not want to specify any events (let alone raw
events), they'll just use whatever default event the tools use.
But for more obscure PMU events that have no cross-architecture
generic events the above syntax is more usable and a bit more
structured than specifying hex numbers."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
perf tools: Remove auto-generated bison/flex files
perf annotate: Fix off by one symbol hist size allocation and hit accounting
perf tools: Add missing ref-cycles event back to event parser
perf annotate: addr2line wants addresses in same format as objdump
perf probe: Finder fails to resolve function name to address
tracing: Fix ent_size in trace output
perf symbols: Handle NULL dso in dso__name_len
perf symbols: Do not include libgen.h
perf tools: Fix bug in raw sample parsing
perf tools: Fix display of first level of callchains
perf tools: Switch module.h into export.h
perf: Move mmap page data_head offset assertion out of header
perf: Fix mmap_page capabilities and docs
perf diff: Fix to work with new hists design
perf tools: Fix modifier to be applied on correct events
perf tools: Fix various casting issues for 32 bits
perf tools: Simplify event_read_id exit path
tracing: Fix ftrace stack trace entries
tracing: Move the tracing_on/off() declarations into CONFIG_TRACING
perf report: Add a simple GTK2-based 'perf report' browser
...
This option has been selected from arch code as it was assumed that
it's necessary to support oneshot mode clockevent devices. But it's
just a core internal helper to compile tick-oneshot.c if NOHZ or
HIG_RES_TIMERS are selected.
Reported-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Maintain a per-mm counter: number of uprobes that are inserted
on this process address space.
This counter can be used at probe hit time to determine if we
need a lookup in the uprobes rbtree. Everytime a probe gets
inserted successfully, the probe count is incremented and
everytime a probe gets removed, the probe count is decremented.
The new uprobe_munmap hook ensures the count is correct on a
unmap or remap of a region. We expect that once a
uprobe_munmap() is called, the vma goes away. So
uprobe_unregister() finding a probe to unregister would either
mean unmap event hasnt occurred yet or a mmap event on the same
executable file occured after a unmap event.
Additionally, uprobe_mmap hook now also gets called:
a. on every executable vma that is COWed at fork.
b. a vma of interest is newly mapped; breakpoint insertion also
happens at the required address.
On process creation, make sure the probes count in the child is
set correctly.
Special cases that are taken care include:
a. mremap
b. VM_DONTCOPY vmas on fork()
c. insertion/removal races in the parent during fork().
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120330182646.10018.85805.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Uprobes executes the original instruction at a probed location
out of line. For this, we allocate a page (per mm) upon the
first uprobe hit, in the process user address space, divide it
into slots that are used to store the actual instructions to be
singlestepped. These slots are known as xol (execution out of
line) slots.
Care is taken to ensure that the allocation is in an unmapped
area as close to the top of the user address space as possible,
with appropriate permission settings to keep selinux like
frameworks happy.
Upon a uprobe hit, a free slot is acquired, and is released
after the singlestep completes.
Lots of improvements courtesy suggestions/inputs from Peter and
Oleg.
[ Folded a fix for build issue on powerpc fixed and reported by
Stephen Rothwell. ]
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120330182631.10018.48175.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The function for_each_cpu_mask() expects a *pointer* to struct
cpumask as its second argument, whereas select_fallback_rq()
passes the value itself.
And moreover, for_each_cpu_mask() has been marked as obselete
in include/linux/cpumask.h. So move to the more appropriate
for_each_cpu() variant.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Dave Jones <davej@redhat.com>
Cc: Liu Chuansheng <chuansheng.liu@intel.com>
Cc: vapier@gentoo.org
Cc: rusty@rustcorp.com.au
Link: http://lkml.kernel.org/r/4F75BED4.9050005@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull genirq updates from Thomas Gleixner.
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Adjust irq thread affinity on IRQ_SET_MASK_OK_NOCOPY return value
genirq: Respect NUMA node affinity in setup_irq_irq affinity()
genirq: Get rid of unneeded force parameter in irq_finalize_oneshot()
genirq: Minor readablity improvement in irq_wake_thread()
Pull core locking updates from Thomas Gleixner.
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Mark get_robust_list as deprecated
futex: Do not leak robust list to unprivileged process
irq_move_masked_irq() checks the return code of
chip->irq_set_affinity() only for 0, but IRQ_SET_MASK_OK_NOCOPY is
also a valid return code, which is there to avoid a redundant copy of
the cpumask. But in case of IRQ_SET_MASK_OK_NOCOPY we not only avoid
the redundant copy, we also fail to adjust the thread affinity of an
eventually threaded interrupt handler.
Handle IRQ_SET_MASK_OK (==0) and IRQ_SET_MASK_OK_NOCOPY(==1) return
values correctly by checking the valid return values seperately.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Keping Chen <chenkeping@huawei.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1333120296-13563-2-git-send-email-jiang.liu@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
setitimer() should return -EFAULT if called with an invalid pointer
for value. The current code excludes a NULL pointer from this rule and
silently uses it to stop the timer. This violates the spec.
Warn about user space apps which rely on that feature and schedule it
for removal.
[ tglx: Massaged changelog, warn message and Doc entry ]
Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
Link: http://lkml.kernel.org/r/1332340854-26053-1-git-send-email-sasikanth.v19@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull urgent cgroup fix from Tejun Heo:
"Commit 61d1d219c4 ('cgroup: remove extra calls to
find_existing_css_set') which was part of the rc1 cgroup pull request
made writes to the cgroup "tasks" file return an uninitialized retval
on success which can cause boot failures with systemd.
The change stayed in linux-next for quite some time but gcc
interestingly failed to emit warning about using uninitialized
variable and the problem seems to materialize only for certain build
combinations (probably depends on register allocation).
It's just missing local variable initialization and the fix is trivial
& safe. As the problem is critical when it materializes, I'm
fast-tracking it. Also included is Li's email address change in
MAINTAINERS."
* 'for-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: cgroup_attach_task() could return -errno after success
cgroup: update MAINTAINERS entry
61d1d219c4 "cgroup: remove extra calls to find_existing_css_set" made
cgroup_task_migrate() return void. An unfortunate side effect was
that cgroup_attach_task() was depending on that function's return
value to clear its @retval on the success path. On cgroup mounts
without any subsystem with ->can_attach() callback,
cgroup_attach_task() ended up returning @retval without initializing
it on success.
For some reason, gcc failed to warn about it and it didn't cause
cgroup_attach_task() to return non-zero value in many cases, probably
due to difference in register allocation. When the problem
materializes, systemd fails to populate /systemd cgroup mount and
fails to boot.
Fix it by initializing @retval to zero on declaration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jiri Kosina <jkosina@suse.cz>
LKML-Reference: <alpine.LNX.2.00.1203282354440.25526@pobox.suse.cz>
Reviewed-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Pull x32 support for x86-64 from Ingo Molnar:
"This tree introduces the X32 binary format and execution mode for x86:
32-bit data space binaries using 64-bit instructions and 64-bit kernel
syscalls.
This allows applications whose working set fits into a 32 bits address
space to make use of 64-bit instructions while using a 32-bit address
space with shorter pointers, more compressed data structures, etc."
Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}
* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
x32: Fix alignment fail in struct compat_siginfo
x32: Fix stupid ia32/x32 inversion in the siginfo format
x32: Add ptrace for x32
x32: Switch to a 64-bit clock_t
x32: Provide separate is_ia32_task() and is_x32_task() predicates
x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
x86/x32: Fix the binutils auto-detect
x32: Warn and disable rather than error if binutils too old
x32: Only clear TIF_X32 flag once
x32: Make sure TS_COMPAT is cleared for x32 tasks
fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
fs: Fix close_on_exec pointer in alloc_fdtable
x32: Drop non-__vdso weak symbols from the x32 VDSO
x32: Fix coding style violations in the x32 VDSO code
x32: Add x32 VDSO support
x32: Allow x32 to be configured
x32: If configured, add x32 system calls to system call tables
x32: Handle process creation
x32: Signal-related system calls
x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
...
Pull more ARM updates from Russell King.
This got a fair number of conflicts with the <asm/system.h> split, but
also with some other sparse-irq and header file include cleanups. They
all looked pretty trivial, though.
* 'for-linus' of git://git.linaro.org/people/rmk/linux-arm: (59 commits)
ARM: fix Kconfig warning for HAVE_BPF_JIT
ARM: 7361/1: provide XIP_VIRT_ADDR for no-MMU builds
ARM: 7349/1: integrator: convert to sparse irqs
ARM: 7259/3: net: JIT compiler for packet filters
ARM: 7334/1: add jump label support
ARM: 7333/2: jump label: detect %c support for ARM
ARM: 7338/1: add support for early console output via semihosting
ARM: use set_current_blocked() and block_sigmask()
ARM: exec: remove redundant set_fs(USER_DS)
ARM: 7332/1: extract out code patch function from kprobes
ARM: 7331/1: extract out insn generation code from ftrace
ARM: 7330/1: ftrace: use canonical Thumb-2 wide instruction format
ARM: 7351/1: ftrace: remove useless memory checks
ARM: 7316/1: kexec: EOI active and mask all interrupts in kexec crash path
ARM: Versatile Express: add NO_IOPORT
ARM: get rid of asm/irq.h in asm/prom.h
ARM: 7319/1: Print debug info for SIGBUS in user faults
ARM: 7318/1: gic: refactor irq_start assignment
ARM: 7317/1: irq: avoid NULL check in for_each_irq_desc loop
ARM: 7315/1: perf: add support for the Cortex-A7 PMU
...
There is extra state information that needs to be exposed in the
kgdb_bpt structure for tracking how a breakpoint was installed. The
debug_core only uses the the probe_kernel_write() to install
breakpoints, but this is not enough for all the archs. Some arch such
as x86 need to use text_poke() in order to install a breakpoint into a
read only page.
Passing the kgdb_bpt structure to kgdb_arch_set_breakpoint() and
kgdb_arch_remove_breakpoint() allows other archs to set the type
variable which indicates how the breakpoint was installed.
Cc: stable@vger.kernel.org # >= 2.6.36
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The Smatch tool warned that the change from commit b8adde8dd
(kdb: Avoid using dbg_io_ops until it is initialized) should
add another null check later in the kdb_printf().
It is worth noting that the second use of dbg_io_ops->is_console
is protected by the KDB_PAGER state variable which would only
get set when kdb is fully active and initialized. If we
ever encounter changes or defects in the KDB_PAGER state
we do not want to crash the kernel in a kdb_printf/printk.
CC: Tim Bird <tim.bird@am.sony.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpusets: Remove an unused variable
sched/rt: Improve pick_next_highest_task_rt()
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
sched/x86/smp: Do not enable IRQs over calibrate_delay()
sched: Fix compiler warning about declared inline after use
MAINTAINERS: Update email address for SCHEDULER and PERF EVENTS
Pull x86 updates from Ingo Molnar.
This touches some non-x86 files due to the sanitized INLINE_SPIN_UNLOCK
config usage.
Fixed up trivial conflicts due to just header include changes (removing
headers due to cpu_idle() merge clashing with the <asm/system.h> split).
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/amd: Be more verbose about LVT offset assignments
x86, tls: Off by one limit check
x86/ioapic: Add io_apic_ops driver layer to allow interception
x86/olpc: Add debugfs interface for EC commands
x86: Merge the x86_32 and x86_64 cpu_idle() functions
x86/kconfig: Remove CONFIG_TR=y from the defconfigs
x86: Stop recursive fault in print_context_stack after stack overflow
x86/io_apic: Move and reenable irq only when CONFIG_GENERIC_PENDING_IRQ=y
x86/apic: Add separate apic_id_valid() functions for selected apic drivers
locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage
x86/kconfig: Update defconfigs
x86: Fix excessive MSR print out when show_msr is not specified
Pull timer core updates from Thomas Gleixner.
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ia64: vsyscall: Add missing paranthesis
alarmtimer: Don't call rtc_timer_init() when CONFIG_RTC_CLASS=n
x86: vdso: Put declaration before code
x86-64: Inline vdso clock_gettime helpers
x86-64: Simplify and optimize vdso clock_gettime monotonic variants
kernel-time: fix s/then/than/ spelling errors
time: remove no_sync_cmos_clock
time: Avoid scary backtraces when warning of > 11% adj
alarmtimer: Make sure we initialize the rtctimer
ntp: Fix leap-second hrtimer livelock
x86, tsc: Skip refined tsc calibration on systems with reliable TSC
rtc: Provide flag for rtc devices that don't support UIE
ia64: vsyscall: Use seqcount instead of seqlock
x86: vdso: Use seqcount instead of seqlock
x86: vdso: Remove bogus locking in update_vsyscall_tz()
time: Remove bogus comments
time: Fix change_clocksource locking
time: x86: Fix race switching from vsyscall to non-vsyscall clock
The debugfs code is really generic for all platforms. This patch removes the
powerpc-specific directory reference and makes it available to all
architectures.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
We don't remove the cpu that went offline from our cpumasks
on cpu hotplug. This got lost somewhere along the line, so
restore it. This fixes a hang of the padata instance on cpu
hotplug.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
We use the active cpumask to determine the superset of cpus
to use for parallelization. However, the active cpumask is
for internal usage of the scheduler and therefore not the
appropriate cpumask for these purposes. So use the online
cpumask instead.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add a reference to the padata api documentation at Documentation/padata.txt
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Notify get_robust_list users that the syscall is going away.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: kernel-hardening@lists.openwall.com
Cc: spender@grsecurity.net
Link: http://lkml.kernel.org/r/20120323190855.GA27213@www.outflux.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It was possible to extract the robust list head address from a setuid
process if it had used set_robust_list(), allowing an ASLR info leak. This
changes the permission checks to be the same as those used for similar
info that comes out of /proc.
Running a setuid program that uses robust futexes would have had:
cred->euid != pcred->euid
cred->euid == pcred->uid
so the old permissions check would allow it. I'm not aware of any setuid
programs that use robust futexes, so this is just a preventative measure.
(This patch is based on changes from grsecurity.)
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: kernel-hardening@lists.openwall.com
Cc: spender@grsecurity.net
Link: http://lkml.kernel.org/r/20120319231253.GA20893@www.outflux.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We respect node affinity of devices already in the irq descriptor
allocation, but we ignore it for the initial interrupt affinity
setup, so the interrupt might be routed to a different node.
Restrict the default affinity mask to the node on which the irq
descriptor is allocated.
[ tglx: Massaged changelog ]
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1332788538-17425-1-git-send-email-prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The only place irq_finalize_oneshot() is called with force parameter set
is the threaded handler error exit path. But IRQTF_RUNTHREAD is dropped
at this point and irq_wake_thread() is not going to set it again,
since PF_EXITING is set for this thread already. So irq_finalize_oneshot()
will drop the threads bit in threads_oneshot anyway and hence the force
parameter is superfluous.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120321162234.GP24806@dhcp-26-207.brq.redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
exit_irq_thread() clears IRQTF_RUNTHREAD flag and drops the thread's bit in
desc->threads_oneshot then. The bit must not be set again in between and it
does not, since irq_wake_thread() sees PF_EXITING flag first and returns.
Due to above the order or checking PF_EXITING and IRQTF_RUNTHREAD flags in
irq_wake_thread() is important. This change just makes it more visible in the
source code.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120321162212.GO24806@dhcp-26-207.brq.redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If schedule is called from an interrupt handler __schedule_bug()
will call show_regs() with the registers saved during the
interrupt handling done in do_IRQ(). This means we'll see the
registers and the backtrace for the process that was interrupted
and not the full backtrace explaining who called schedule().
This is due to 838225b ("sched: use show_regs() to improve
__schedule_bug() output", 2007-10-24) which improperly assumed
that get_irq_regs() would return the registers for the current
stack because it is being called from within an interrupt
handler. Simply remove the show_reg() code so that we dump a
backtrace for the interrupt handler that called schedule().
[ I ran across this when I was presented with a scheduling while
atomic log with a stacktrace pointing at spin_unlock_irqrestore().
It made no sense and I had to guess what interrupt handler could
be called and poke around for someone calling schedule() in an
interrupt handler. A simple test of putting an msleep() in
an interrupt handler works better with this patch because you
can actually see the msleep() call in the backtrace. ]
Also-reported-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Satyam Sharma <satyam@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1332979847-27102-1-git-send-email-sboyd@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge third batch of patches from Andrew Morton:
- Some MM stragglers
- core SMP library cleanups (on_each_cpu_mask)
- Some IPI optimisations
- kexec
- kdump
- IPMI
- the radix-tree iterator work
- various other misc bits.
"That'll do for -rc1. I still have ~10 patches for 3.4, will send
those along when they've baked a little more."
* emailed from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
backlight: fix typo in tosa_lcd.c
crc32: add help text for the algorithm select option
mm: move hugepage test examples to tools/testing/selftests/vm
mm: move slabinfo.c to tools/vm
mm: move page-types.c from Documentation to tools/vm
selftests/Makefile: make `run_tests' depend on `all'
selftests: launch individual selftests from the main Makefile
radix-tree: use iterators in find_get_pages* functions
radix-tree: rewrite gang lookup using iterator
radix-tree: introduce bit-optimized iterator
fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
nbd: rename the nbd_device variable from lo to nbd
pidns: add reboot_pid_ns() to handle the reboot syscall
sysctl: use bitmap library functions
ipmi: use locks on watchdog timeout set on reboot
ipmi: simplify locking
ipmi: fix message handling during panics
ipmi: use a tasklet for handling received messages
ipmi: increase KCS timeouts
ipmi: decrease the IPMI message transaction time in interrupt mode
...
In the case of a child pid namespace, rebooting the system does not really
makes sense. When the pid namespace is used in conjunction with the other
namespaces in order to create a linux container, the reboot syscall leads
to some problems.
A container can reboot the host. That can be fixed by dropping the
sys_reboot capability but we are unable to correctly to poweroff/
halt/reboot a container and the container stays stuck at the shutdown time
with the container's init process waiting indefinitively.
After several attempts, no solution from userspace was found to reliabily
handle the shutdown from a container.
This patch propose to make the init process of the child pid namespace to
exit with a signal status set to : SIGINT if the child pid namespace
called "halt/poweroff" and SIGHUP if the child pid namespace called
"reboot". When the reboot syscall is called and we are not in the initial
pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
"RESTART", and "RESTART2". Otherwise we return EINVAL.
Returning EINVAL is also an easy way to check if this feature is supported
by the kernel when invoking another 'reboot' option like CAD.
By this way the parent process of the child pid namespace knows if it
rebooted or not and can take the right decision.
Test case:
==========
#include <alloca.h>
#include <stdio.h>
#include <sched.h>
#include <unistd.h>
#include <signal.h>
#include <sys/reboot.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <linux/reboot.h>
static int do_reboot(void *arg)
{
int *cmd = arg;
if (reboot(*cmd))
printf("failed to reboot(%d): %m\n", *cmd);
}
int test_reboot(int cmd, int sig)
{
long stack_size = 4096;
void *stack = alloca(stack_size) + stack_size;
int status;
pid_t ret;
ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
if (ret < 0) {
printf("failed to clone: %m\n");
return -1;
}
if (wait(&status) < 0) {
printf("unexpected wait error: %m\n");
return -1;
}
if (!WIFSIGNALED(status)) {
printf("child process exited but was not signaled\n");
return -1;
}
if (WTERMSIG(status) != sig) {
printf("signal termination is not the one expected\n");
return -1;
}
return 0;
}
int main(int argc, char *argv[])
{
int status;
status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");
status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");
status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");
status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");
status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
if (status >= 0) {
printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
return 1;
}
printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");
return 0;
}
[akpm@linux-foundation.org: tweak and add comments]
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use bitmap_set() instead of using set_bit() for each bit. This conversion
is valid because the bitmap is private in the function call and atomic
bitops were unnecessary.
This also includes minor change.
- Use bitmap_copy() for shorter typing
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using crashkernel=2M-256M, the kernel doesn't give any warning. This
is misleading sometimes.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nommu platforms don't have very interesting swapper_pg_dir pointers and
usually just #define them to NULL, meaning that we can't include them in
the vmcoreinfo on the kexec crash path.
This patch only saves the swapper_pg_dir if we have an MMU.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
calculates the cpumask of cpus to IPI by calling a function supplied as a
parameter in order to determine whether to IPI each specific cpu.
The function works around allocation failure of cpumask variable in
CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
via smp_call_function_single().
The function is useful since it allows to seperate the specific code that
decided in each case whether to IPI a specific cpu for a specific request
from the common boilerplate code of handling creating the mask, handling
failures etc.
[akpm@linux-foundation.org: s/gfpflags/gfp_flags/]
[akpm@linux-foundation.org: avoid double-evaluation of `info' (per Michal), parenthesise evaluation of `cond_func']
[akpm@linux-foundation.org: s/CPU/CPUs, use all 80 cols in comment]
Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Michal Nazarewicz <mina86@mina86.org>
Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
Cc: Milton Miller <miltonm@bga.com>
Reviewed-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have lots of infrastructure in place to partition multi-core systems
such that we have a group of CPUs that are dedicated to specific task:
cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
Still, kernel code will at times interrupt all CPUs in the system via IPIs
for various needs. These IPIs are useful and cannot be avoided
altogether, but in certain cases it is possible to interrupt only specific
CPUs that have useful work to do and not the entire system.
This patch set, inspired by discussions with Peter Zijlstra and Frederic
Weisbecker when testing the nohz task patch set, is a first stab at trying
to explore doing this by locating the places where such global IPI calls
are being made and turning the global IPI into an IPI for a specific group
of CPUs. The purpose of the patch set is to get feedback if this is the
right way to go for dealing with this issue and indeed, if the issue is
even worth dealing with at all. Based on the feedback from this patch set
I plan to offer further patches that address similar issue in other code
paths.
This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
infrastructure API (the former derived from existing arch specific
versions in Tile and Arm) and uses them to turn several global IPI
invocation to per CPU group invocations.
Core kernel:
on_each_cpu_mask() calls a function on processors specified by cpumask,
which may or may not include the local processor.
You must not call this function with disabled interrupts or from a
hardware interrupt handler or from a bottom half handler.
arch/arm:
Note that the generic version is a little different then the Arm one:
1. It has the mask as first parameter
2. It calls the function on the calling CPU with interrupts disabled,
but this should be OK since the function is called on the other CPUs
with interrupts disabled anyway.
arch/tile:
The API is the same as the tile private one, but the generic version
also calls the function on the with interrupts disabled in UP case
This is OK since the function is called on the other CPUs
with interrupts disabled.
Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Michal Nazarewicz <mina86@mina86.org>
Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
Cc: Milton Miller <miltonm@bga.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
/5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
NypQthI85pc=
=G9mT
-----END PGP SIGNATURE-----
Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system
Pull "Disintegrate and delete asm/system.h" from David Howells:
"Here are a bunch of patches to disintegrate asm/system.h into a set of
separate bits to relieve the problem of circular inclusion
dependencies.
I've built all the working defconfigs from all the arches that I can
and made sure that they don't break.
The reason for these patches is that I recently encountered a circular
dependency problem that came about when I produced some patches to
optimise get_order() by rewriting it to use ilog2().
This uses bitops - and on the SH arch asm/bitops.h drags in
asm-generic/get_order.h by a circuituous route involving asm/system.h.
The main difficulty seems to be asm/system.h. It holds a number of
low level bits with no/few dependencies that are commonly used (eg.
memory barriers) and a number of bits with more dependencies that
aren't used in many places (eg. switch_to()).
These patches break asm/system.h up into the following core pieces:
(1) asm/barrier.h
Move memory barriers here. This already done for MIPS and Alpha.
(2) asm/switch_to.h
Move switch_to() and related stuff here.
(3) asm/exec.h
Move arch_align_stack() here. Other process execution related bits
could perhaps go here from asm/processor.h.
(4) asm/cmpxchg.h
Move xchg() and cmpxchg() here as they're full word atomic ops and
frequently used by atomic_xchg() and atomic_cmpxchg().
(5) asm/bug.h
Move die() and related bits.
(6) asm/auxvec.h
Move AT_VECTOR_SIZE_ARCH here.
Other arch headers are created as needed on a per-arch basis."
Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that. We'll find out anything that got broken and fix it..
* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
Delete all instances of asm/system.h
Remove all #inclusions of asm/system.h
Add #includes needed to permit the removal of asm/system.h
Move all declarations of free_initmem() to linux/mm.h
Disintegrate asm/system.h for OpenRISC
Split arch_align_stack() out from asm-generic/system.h
Split the switch_to() wrapper out of asm-generic/system.h
Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
Create asm-generic/barrier.h
Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
Disintegrate asm/system.h for Xtensa
Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
Disintegrate asm/system.h for Tile
Disintegrate asm/system.h for Sparc
Disintegrate asm/system.h for SH
Disintegrate asm/system.h for Score
Disintegrate asm/system.h for S390
Disintegrate asm/system.h for PowerPC
Disintegrate asm/system.h for PA-RISC
Disintegrate asm/system.h for MN10300
...
The new API, pm_qos_update_request_timeout() is to provide a timeout
with pm_qos_update_request.
For example, pm_qos_update_request_timeout(req, 100, 1000), means that
QoS request on req with value 100 will be active for 1000 microseconds.
After 1000 microseconds, the QoS request thru req is reset. If there
were another pm_qos_update_request(req, x) during the 1000 us, this
new request with value x will override as this is another request on the
same req handle. A new request on the same req handle will always
override the previous request whether it is the conventional request or
it is the new timeout request.
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Mark Gross <markgross@thegnar.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
There is a race condition between the freezer and request_firmware()
such that if request_firmware() is run on one CPU and
freeze_processes() is run on another CPU and usermodehelper_disable()
called by it succeeds to grab umhelper_sem for writing before
usermodehelper_read_trylock() called from request_firmware()
acquires it for reading, the request_firmware() will fail and
trigger a WARN_ON() complaining that it was called at a wrong time.
However, in fact, it wasn't called at a wrong time and
freeze_processes() simply happened to be executed simultaneously.
To avoid this race, at least in some cases, modify
usermodehelper_read_trylock() so that it doesn't fail if the
freezing of tasks has just started and hasn't been completed yet.
Instead, during the freezing of tasks, it will try to freeze the
task that has called it so that it can wait until user space is
thawed without triggering the scary warning.
For this purpose, change usermodehelper_disabled so that it can
take three different values, UMH_ENABLED (0), UMH_FREEZING and
UMH_DISABLED. The first one means that usermode helpers are
enabled, the last one means "hard disable" (i.e. the system is not
ready for usermode helpers to be used) and the second one
is reserved for the freezer. Namely, when freeze_processes() is
started, it sets usermodehelper_disabled to UMH_FREEZING which
tells usermodehelper_read_trylock() that it shouldn't fail just
yet and should call try_to_freeze() if woken up and cannot
return immediately. This way all freezable tasks that happen
to call request_firmware() right before freeze_processes() is
started and lose the race for umhelper_sem with it will be
frozen and will sleep until thaw_processes() unsets
usermodehelper_disabled. [For the non-freezable callers of
request_firmware() the race for umhelper_sem against
freeze_processes() is unfortunately unavoidable.]
Reported-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
The core suspend/hibernation code calls usermodehelper_disable() to
avoid race conditions between the freezer and the starting of
usermode helpers and each code path has to do that on its own.
However, it is always called right before freeze_processes()
and usermodehelper_enable() is always called right after
thaw_processes(). For this reason, to avoid code duplication and
to make the connection between usermodehelper_disable() and the
freezer more visible, make freeze_processes() call it and remove the
direct usermodehelper_disable() and usermodehelper_enable() calls
from all suspend/hibernation code paths.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
There is no reason to call usermodehelper_disable() before creating
memory bitmaps in hibernate() and software_resume(), so call it right
before freeze_processes(), in accordance with the other suspend and
hibernation code. Consequently, call usermodehelper_enable() right
after the thawing of tasks rather than after freeing the memory
bitmaps.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
If firmware is requested asynchronously, by calling
request_firmware_nowait(), there is no reason to fail the request
(and warn the user) when the system is (presumably temporarily)
unready to handle it (because user space is not available yet or
frozen). For this reason, introduce an alternative routine for
read-locking umhelper_sem, usermodehelper_read_lock_wait(), that
will wait for usermodehelper_disabled to be unset (possibly with
a timeout) and make request_firmware_work_func() use it instead of
usermodehelper_read_trylock().
Accordingly, modify request_firmware() so that it uses
usermodehelper_read_trylock() to acquire umhelper_sem and remove
the code related to that lock from _request_firmware().
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Instead of two functions, read_lock_usermodehelper() and
usermodehelper_is_disabled(), used in combination, introduce
usermodehelper_read_trylock() that will only return with umhelper_sem
held if usermodehelper_disabled is unset (and will return -EAGAIN
otherwise) and make _request_firmware() use it.
Rename read_unlock_usermodehelper() to
usermodehelper_read_unlock() to follow the naming convention of the
new function.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:
perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
Signed-off-by: David Howells <dhowells@redhat.com>
asm/system.h is a cause of circular dependency problems because it contains
commonly used primitive stuff like barrier definitions and uncommonly used
stuff like switch_to() that might require MMU definitions.
asm/system.h has been disintegrated by this point on all arches into the
following common segments:
(1) asm/barrier.h
Moved memory barrier definitions here.
(2) asm/cmpxchg.h
Moved xchg() and cmpxchg() here. #included in asm/atomic.h.
(3) asm/bug.h
Moved die() and similar here.
(4) asm/exec.h
Moved arch_align_stack() here.
(5) asm/elf.h
Moved AT_VECTOR_SIZE_ARCH here.
(6) asm/switch_to.h
Moved switch_to() here.
Signed-off-by: David Howells <dhowells@redhat.com>
We don't use "cpu" any more after 2baab4e904 "sched: Fix
select_fallback_rq() vs cpu_active/cpu_online".
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120328104608.GD29022@elgon.mountain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When reading the trace file, the records of each of the per_cpu buffers
are examined to find the next event to print out. At the point of looking
at the event, the size of the event is recorded. But if the first event is
chosen, the other events in the other CPU buffers will reset the event size
that is stored in the iterator descriptor, causing the event size passed to
the output functions to be incorrect.
In most cases this is not a problem, but for the case of stack traces, it
is. With the change to the stack tracing to record a dynamic number of
back traces, the output depends on the size of the entry instead of the
fixed 8 back traces. When the entry size is not correct, the back traces
would not be fully printed.
Note, reading from the per-cpu trace files were not affected.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Avoid extra work by continuing on to the next rt_rq if the highest
prio task in current rt_rq is the same priority as our candidate
task.
More detailed explanation: if next is not NULL, then we have found a
candidate task, and its priority is next->prio. Now we are looking
for an even higher priority task in the other rt_rq's. idx is the
highest priority in the current candidate rt_rq. In the current 3.3
code, if idx is equal to next->prio, we would start scanning the tasks
in that rt_rq and replace the current candidate task with a task from
that rt_rq. But the new task would only have a priority that is equal
to our previous candidate task, so we have not advanced our goal of
finding a higher prio task. So we should avoid the extra work by
continuing on to the next rt_rq if idx is equal to next->prio.
Signed-off-by: Michael J Wang <mjwang@broadcom.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/2EF88150C0EF2C43A218742ED384C1BC0FC83D6B@IRVEXCHMB08.corp.ad.broadcom.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Module size was limited to 64MB, this was legacy limitation due to vmalloc()
which was removed a while ago.
Limiting module size to 64MB is both pointless and affects real world use
cases.
Cc: Tim Abbott <tim.abbott@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This patch adds a set of macros that can be used to declare
kernel parameters to be parsed _before_ initcalls at a chosen
level are executed. We rename the now-unused "flags" field of
struct kernel_param as the level. It's signed, for when we
use this for early params as well, in future.
Linker macro collating init calls had to be modified in order
to add additional symbols between levels that are later used
by the init code to split the calls into blocks.
Signed-off-by: Pawel Moll <pawel.moll@arm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.
This eliminates that code (though leaves the flags field in the struct,
for impending use).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sometimes we need to test a kernel of same version with code or config
option changes.
We already have sysctl to disable module load, but add a kernel
parameter will be more convenient.
Since modules_disabled is int, so here use bint type in core_param.
TODO: make sysctl accept bool and change modules_disabled to bool
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
"[PATCH 0/3] RFC - module.h usage cleanups in fs/ and lib/"
https://lkml.org/lkml/2012/2/29/589
--
Fix up files in fs/ and lib/ dirs to only use module.h if they really
need it.
These are trivial in scope vs. the work done previously. We now have
things where any few remaining cleanups can be farmed out to arch or
subsystem maintainers, and I have done so when possible. What is
remaining here represents the bits that don't clearly lie within a
single arch/subsystem boundary, like the fs dir and the lib dir.
Some duplicate includes arising from overlapping fixes from
independent subsystem maintainer submissions are also quashed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPbNw3AAoJEOvOhAQsB9HWA7wQALrsQ6V6Z+B3KsvSoD5kFnpZ
Y+4uggs+GdUdWmtRrZnTBp896gGuUgBxc3syA2XWd7Oqi49+c5c1m0cFxKyVdIHm
fB+jmxS69soADtHR3cXmxcQshrUzUf2rTn8frcw4O/BmJuplv4xT9uPQzwGaRSZT
gomQsQ1bGnkwjO2jfS8f/N5Mjr8u/z0WF7TTOTUSq+Cv3BervPaSPF1Ea6J8oo+N
4+/n8RlU1HWiI4inrgrFPN6UHmE45BAL2xGbB47LgooHJW8P5kAnU+vxGScaoy1Q
JKX9WKT3VCiwR3VOPa86iLKP3Y8a3VlhyGn+yzzcYkGX/n0tbT7aoRhQm21sGIv0
DoeXWe7aiiY8cEW69G6GIfRPFl+Zh81m1Whbu7IZT/sV3asx6jWmEXE8CgCfeDt5
mNQk9D4Irf6+rmCSbeSVC4L0eFfLxNFouNyh2aus/q+gIjKNKYwZQryHrodK4wpv
UgMKSTZfPrTAWay2gCNWNqo3Zs8e1LDqkftetxeU3jx2kTuaNzBl4Y7mhsX7sLYe
MsFX3JUJ2pn6XWbgqcY+bdr/mzgsCrjzqdf15MTUzEc5SIfVF+XpNNZN1ITwl6UA
/ZH9keBu1mEdCoPU5W74kYwx4p35hIeWJGfc0MRp07ruf941F+SBgMD11B0+06f0
pN0DcITTkD16+sS4x1cB
=Z4w0
-----END PGP SIGNATURE-----
Merge tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker:
"Fix up files in fs/ and lib/ dirs to only use module.h if they really
need it.
These are trivial in scope vs the work done previously. We now have
things where any few remaining cleanups can be farmed out to arch or
subsystem maintainers, and I have done so when possible. What is
remaining here represents the bits that don't clearly lie within a
single arch/subsystem boundary, like the fs dir and the lib dir.
Some duplicate includes arising from overlapping fixes from
independent subsystem maintainer submissions are also quashed."
Fix up trivial conflicts due to clashes with other include file cleanups
(including some due to the previous bug.h cleanup pull).
* tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
lib: reduce the use of module.h wherever possible
fs: reduce the use of module.h wherever possible
includecheck: delete any duplicate instances of module.h
rtc_timer_init() is not available when CONFIG_RTC_CLASS=n. Provide a
proper wrapper in the RTC section of alarmtimer.c
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Pull sysctl updates from Eric Biederman:
- Rewrite of sysctl for speed and clarity.
Insert/remove/Lookup in sysctl are all now O(NlogN) operations, and
are no longer bottlenecks in the process of adding and removing
network devices.
sysctl is now focused on being a filesystem instead of system call
and the code can all be found in fs/proc/proc_sysctl.c. Hopefully
this means the code is now approachable.
Much thanks is owed to Lucian Grinjincu for keeping at this until
something was found that was usable.
- The recent proc_sys_poll oops found by the fuzzer during hibernation
is fixed.
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl: (36 commits)
sysctl: protect poll() in entries that may go away
sysctl: Don't call sysctl_follow_link unless we are a link.
sysctl: Comments to make the code clearer.
sysctl: Correct error return from get_subdir
sysctl: An easier to read version of find_subdir
sysctl: fix memset parameters in setup_sysctl_set()
sysctl: remove an unused variable
sysctl: Add register_sysctl for normal sysctl users
sysctl: Index sysctl directories with rbtrees.
sysctl: Make the header lists per directory.
sysctl: Move sysctl_check_dups into insert_header
sysctl: Modify __register_sysctl_paths to take a set instead of a root and an nsproxy
sysctl: Replace root_list with links between sysctl_table_sets.
sysctl: Add sysctl_print_dir and use it in get_subdir
sysctl: Stop requiring explicit management of sysctl directories
sysctl: Add a root pointer to ctl_table_set
sysctl: Rewrite proc_sys_readdir in terms of first_entry and next_entry
sysctl: Rewrite proc_sys_lookup introducing find_entry and lookup_entry.
sysctl: Normalize the root_table data structure.
sysctl: Factor out insert_header and erase_header
...
As Tetsuo Handa pointed out, request_module() can stress the system
while the oom-killed caller sleeps in TASK_UNINTERRUPTIBLE.
The task T uses "almost all" memory, then it does something which
triggers request_module(). Say, it can simply call sys_socket(). This
in turn needs more memory and leads to OOM. oom-killer correctly
chooses T and kills it, but this can't help because it sleeps in
TASK_UNINTERRUPTIBLE and after that oom-killer becomes "disabled" by the
TIF_MEMDIE task T.
Make __request_module() killable. The only necessary change is that
call_modprobe() should kmalloc argv and module_name, they can't live in
the stack if we use UMH_KILLABLE. This memory is freed via
call_usermodehelper_freeinfo()->cleanup.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes. Move the call_usermodehelper code from
__request_module() into the new simple helper, call_modprobe().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minor cleanup. ____call_usermodehelper() can simply return, no need to
call do_exit() explicitely.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes. It is not sane to use UMH_KILLABLE with enum
umh_wait, but obviously we do not want another argument in
call_usermodehelper_* helpers. Kill this enum, use the plain int.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement UMH_KILLABLE, should be used along with UMH_WAIT_EXEC/PROC.
The caller must ensure that subprocess_info->path/etc can not go away
until call_usermodehelper_freeinfo().
call_usermodehelper_exec(UMH_KILLABLE) does
wait_for_completion_killable. If it fails, it uses
xchg(&sub_info->complete, NULL) to serialize with umh_complete() which
does the same xhcg() to access sub_info->complete.
If call_usermodehelper_exec wins, it can safely return. umh_complete()
should get NULL and call call_usermodehelper_freeinfo().
Otherwise we know that umh_complete() was already called, in this case
call_usermodehelper_exec() falls back to wait_for_completion() which
should succeed "very soon".
Note: UMH_NO_WAIT == -1 but it obviously should not be used with
UMH_KILLABLE. We delay the neccessary cleanup to simplify the back
porting.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Preparation. Add the new trivial helper, umh_complete(). Currently it
simply does complete(sub_info->complete).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change zap_pid_ns_processes() to use SEND_SIG_FORCED, it looks more
clear compared to SEND_SIG_NOINFO which relies on from_ancestor_ns logic
send_signal().
It is also more efficient if we need to kill a lot of tasks because it
doesn't alloc sigqueue.
While at it, add the __fatal_signal_pending(task) check as a minor
optimization.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cosmetic, rename the from_ancestor_ns argument in prepare_signal()
paths. After the previous change it doesn't match the reality.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
force_sig_info() and friends have the special semantics for synchronous
signals, this interface should not be used if the target is not current.
And it needs the fixes, in particular the clearing of SIGNAL_UNKILLABLE
is not exactly right.
However there are callers which have to use force_ exactly because it
clears SIGNAL_UNKILLABLE and thus it can kill the CLONE_NEWPID tasks,
although this is almost always is wrong by various reasons.
With this patch SEND_SIG_FORCED ignores SIGNAL_UNKILLABLE, like we do if
the signal comes from the ancestor namespace.
This makes the naming in prepare_signal() paths insane, fixed by the
next cleanup.
Note: this only affects SIGKILL/SIGSTOP, but this is enough for
force_sig() abusers.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PTRACE_SEIZE code is tested and ready for production use, remove the
code which requires special bit in data argument to make PTRACE_SEIZE
work.
Strace team prepares for a new release of strace, and we would like to
ship the code which uses PTRACE_SEIZE, preferably after this change goes
into released kernel.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This can be used to close a few corner cases in strace where we get
unwanted racy behavior after attach, but before we have a chance to set
options (the notorious post-execve SIGTRAP comes to mind), and removes
the need to track "did we set opts for this task" state in strace
internals.
While we are at it:
Make it possible to extend SEIZE in the future with more functionality
by passing non-zero 'addr' parameter. To that end, error out if 'addr'
is non-zero. PTRACE_ATTACH did not (and still does not) have such
check, and users (strace) do pass garbage there... let's avoid
repeating this mistake with SEIZE.
Set all task->ptrace bits in one operation - before this change, we were
adding PT_SEIZED and PT_PTRACE_CAP with task->ptrace |= BIT ops. This
was probably ok (not a bug), but let's be on a safer side.
Changes since v2: use (unsigned long) casts instead of (long) ones, move
PTRACE_SEIZE_DEVEL-related code to separate lines of code.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Pedro Alves <palves@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Exchange PT_TRACESYSGOOD and PT_PTRACE_CAP bit positions, which makes
PT_option bits contiguous and therefore makes code in
ptrace_setoptions() much simpler.
Every PTRACE_O_TRACEevent is defined to (1 << PTRACE_EVENT_event)
instead of using explicit numeric constants, to ensure we don't mess up
relationship between bit positions and event ids.
PT_EVENT_FLAG_SHIFT was not particularly useful, PT_OPT_FLAG_SHIFT with
value of PT_EVENT_FLAG_SHIFT-1 is easier to use.
PT_TRACE_MASK constant is nuked, the only its use is replaced by
(PTRACE_O_MASK << PT_OPT_FLAG_SHIFT).
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On ptrace(PTRACE_SETOPTIONS, pid, 0, <opts>), we used to set those
option bits which are known, and then fail with -EINVAL if there are
some unknown bits in <opts>.
This is inconsistent with typical error handling, which does not change
any state if input is invalid.
This patch changes PTRACE_SETOPTIONS behavior so that in this case, we
return -EINVAL and don't change any bits in task->ptrace.
It's very unlikely that there is userspace code in the wild which will
be affected by this change: it should have the form
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_BOGUSOPT)
where PTRACE_O_BOGUSOPT is a constant unknown to the kernel. But kernel
headers, naturally, don't contain any PTRACE_O_BOGUSOPTs, thus the only
way userspace can use one if it defines one itself. I can't see why
anyone would do such a thing deliberately.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It fixes some 80-col wordwrappings and adds some consistency.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the system is loaded while hotplugging a CPU we might end up with a
bogus hardlockup detection. This has been seen during LTP pounder test
executed in parallel with hotplug test.
The main problem is that enable_watchdog (called when CPU is brought up)
registers perf event which periodically checks per-cpu counter
(hrtimer_interrupts), updated from a hrtimer callback, but the hrtimer
is fired from the kernel thread.
This means that while we already do check for the hard lockup the kernel
thread might be sitting on the runqueue with zillions of tasks so there
is nobody to update the value we rely on and so we KABOOM.
Let's fix this by boosting the watchdog thread priority before we wake
it up rather than when it's already running. This still doesn't handle
a case where we have the same amount of high prio FIFO tasks but that
doesn't seem to be common. The current implementation doesn't handle
that case anyway so this is not worse at least.
Unfortunately, we cannot start perf counter from the watchdog thread
because we could miss a real lock up and also we cannot start the
hrtimer watchdog_enable because we there is no way (at least I don't
know any) to start a hrtimer from a different CPU.
[dzickus@redhat.com: fix compile issue with param]
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I just received another user's pleas for help when their init
mysteriously died. I again explained that they need to check whether it
died because of bad instruction, a segv, or something else. Which was
an annoying detour into writing a trivial C program to spawn his init
and print its exit code:
http://lists.busybox.net/pipermail/busybox/2012-January/077172.html
I hear you saying "just test it under /bin/sh". Well, the crashing init
_was_ /bin/sh.
Which prompted me to make kernel do this first step automatically. We can
print exit code, which makes it possible to see that death was from e.g.
SIGILL without writing test programs.
[akpm@linux-foundation.org: add 0x to hex number output]
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userspace service managers/supervisors need to track their started
services. Many services daemonize by double-forking and get implicitly
re-parented to PID 1. The service manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait(). All information about the children is
lost at the moment PID 1 cleans up the re-parented processes.
With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services. All SIGCHLD signals will be delivered
to the service manager.
Receiving SIGCHLD and doing wait() is in cases of a service-manager much
preferred over any possible asynchronous notification about specific
PIDs, because the service manager has full access to the child process
data in /proc and the PID can not be re-used until the wait(), the
service-manager itself is in charge of, has happened.
As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and
'ps' output:
before:
# ps afx
253 ? Ss 0:00 /bin/dbus-daemon --system --nofork
294 ? Sl 0:00 /usr/libexec/polkit-1/polkitd
328 ? S 0:00 /usr/sbin/modem-manager
608 ? Sl 0:00 /usr/libexec/colord
658 ? Sl 0:00 /usr/libexec/upowerd
819 ? Sl 0:00 /usr/libexec/imsettings-daemon
916 ? Sl 0:00 /usr/libexec/udisks-daemon
917 ? S 0:00 \_ udisks-daemon: not polling any devices
after:
# ps afx
294 ? Ss 0:00 /bin/dbus-daemon --system --nofork
426 ? Sl 0:00 \_ /usr/libexec/polkit-1/polkitd
449 ? S 0:00 \_ /usr/sbin/modem-manager
635 ? Sl 0:00 \_ /usr/libexec/colord
705 ? Sl 0:00 \_ /usr/libexec/upowerd
959 ? Sl 0:00 \_ /usr/libexec/udisks-daemon
960 ? S 0:00 | \_ udisks-daemon: not polling any devices
977 ? Sl 0:00 \_ /usr/libexec/packagekitd
This prctl is orthogonal to PID namespaces. PID namespaces are isolated
from each other, while a service management process usually requires the
services to live in the same namespace, to be able to talk to each
other.
Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on-demand. Both need init-like capabilities to
be able to properly keep track of the services they start.
Many thanks to Oleg for several rounds of review and insights.
[akpm@linux-foundation.org: fix comment layout and spelling]
[akpm@linux-foundation.org: add lengthy code comment from Oleg]
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use than for comparisons, like more than.
CC: John Stultz <john.stultz@linaro.org>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Commit 9863c90f68 (x86, vmware: Remove
deprecated VMI kernel support) removed the only place which set
no_sync_cmos_clock. Since that commit, this variable is never set.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Folks have been getting a number of warnings about time
adjustments > 11%. The WARN_ON leaves a big useless backtrace
so this patch removes it for a printk_once().
I'm still working to narrow down the cause of the > 11% adjustment.
Signed-off-by: John Stultz <john.stultz@linaro.org>
jonghwan Choi reported seeing warnings with the alarmtimer
code at suspend/resume time, and pointed out that the
rtctimer isn't being properly initialized.
This patch corrects this issue.
Reported-by: jonghwan Choi <jhbird.choi@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
- Fix KDB keyboard repeat scan codes and leaked keyboard events
- Fix kernel crash with kdb_printf() for users who compile new kdb_printf()'s
in early code
- Return all segment registers to gdb on x86_64
Features:
- KDB/KGDB hook the reboot notifier and end user can control if it stops,
detaches or does nothing (updated docs as well)
- Notify users who use CONFIG_DEBUG_RODATA to use hw breakpoints
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPbJ17AAoJEIciOldedpOjQeIP/AkUxQFJ7O4aLrLYHl62EHnh
spkgkd+nBzIzcKyV73alkrVBaR2WE2822aAPQmAPBP8/X283DZJJjqgDCUNVI1Mf
CIZ7g8AQHRnS+bAZmof5Jss4malZn4byLvG/cfpOivrsye+4A8MdrAKKM3pYWNVy
4xABkcEknVEEamdNEhHrcPd+xehretfw7+9mmU8hfjqHb/5cXFB7JwDcf4tF7ozT
MDyN4xKtOn1W/ftQl0t6izksCUuPyqKzIfUyAy0j6AwTgsEavXu56S52T1UoB2ZI
JcwLn/ZpN4eGCWVodY04R3jzaMtKFb6ImY40wsb5wl3CU3Ecy5syMU6z4fg3cvjH
/KE6xWF61j4yiE5lzjeJVtKyxwalthzrr56qU2uEwrsEVmo3SOnW9sm0cwouqa7j
/TAMlhZuGgbZGesFwdaUKI5OLGoki+pRQ0Gaq3TsbZwpPC5Bimkq0bIvruruKJCP
QWVkEvb5TZgxCFS3AvniePOm7Hc2wD9zXB3OfN3o91pCom0ryDBIthcLlwhVeNCo
Jd67pnJJNVULPF/1qVicZihKHxvG3DUb4E9pUcbgJplBke+isi+3eHOvnQrYFjIG
iCamE9qvVbsQm/OFV8MOJ5mfPs9R+nb/jNzTO8JDmBc8AL7nRDV3AFGjW68x/KWT
ERcqNEGJ4QuVAxfejq76
=SXu9
-----END PGP SIGNATURE-----
Merge tag 'for_linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull KGDB/KDB updates from Jason Wessel:
"Fixes:
- Fix KDB keyboard repeat scan codes and leaked keyboard events
- Fix kernel crash with kdb_printf() for users who compile new
kdb_printf()'s in early code
- Return all segment registers to gdb on x86_64
Features:
- KDB/KGDB hook the reboot notifier and end user can control if it
stops, detaches or does nothing (updated docs as well)
- Notify users who use CONFIG_DEBUG_RODATA to use hw breakpoints"
* tag 'for_linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kdb: Add message about CONFIG_DEBUG_RODATA on failure to install breakpoint
kdb: Avoid using dbg_io_ops until it is initialized
kgdb,debug_core: add the ability to control the reboot notifier
KDB: Fix usability issues relating to the 'enter' key.
kgdb,debug-core,gdbstub: Hook the reboot notifier for debugger detach
kgdb: Respect that flush op is optional
kgdb: x86: Return all segment registers also in 64-bit mode
Get rid of INLINE_SPIN_UNLOCK entirely replacing it with
UNINLINE_SPIN_UNLOCK instead of the reverse meaning.
Whoever wants to change the default spinlock inlining
behavior and uninline the spinlocks for some weird reason,
such as spinlock debugging, paravirt etc. can now all just
select UNINLINE_SPIN_UNLOCK
Original discussion at: https://lkml.org/lkml/2012/3/21/357
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: linux-mips@linux-mips.org
Link: http://lkml.kernel.org/r/20120322095502.30866.75756.sendpatchset@codeblue
[ tidied up the changelog a bit ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kernel/sched/fair.c:420: warning: 'account_cfs_rq_runtime' declared inline after being called
kernel/sched/fair.c:420: warning: previous declaration of 'account_cfs_rq_runtime' was here
kernel/sched/fair.c:1165: warning: 'return_cfs_rq_runtime' declared inlineafter being called
kernel/sched/fair.c:1165: warning: previous declaration of 'return_cfs_rq_runtime' was here
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120321200717.49BB4A024E@akpm.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Complete the syscall-less self-profiling feature and address
all complaints, namely:
- capabilities, so we can detect what is actually available at runtime
Add a capabilities field to perf_event_mmap_page to indicate
what is actually available for use.
- on x86: RDPMC weirdness due to being 40/48 bits and not sign-extending
properly.
- ABI documentation as to how all this stuff works.
Also improve the documentation for the new features.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1332433596.2487.33.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull input subsystem updates from Dmitry Torokhov:
"- we finally merged driver for USB version of Synaptics touchpads
(I guess most commonly found in IBM/Lenovo keyboard/touchpad combo);
- a bunch of new drivers for embedded platforms (Cypress
touchscreens, DA9052 OnKey, MAX8997-haptic, Ilitek ILI210x
touchscreens, TI touchscreen);
- input core allows clients to specify desired clock source for
timestamps on input events (EVIOCSCLOCKID ioctl);
- input core allows querying state of all MT slots for given event
code via EVIOCGMTSLOTS ioctl;
- various driver fixes and improvements."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (45 commits)
Input: ili210x - add support for Ilitek ILI210x based touchscreens
Input: altera_ps2 - use of_match_ptr()
Input: synaptics_usb - switch to module_usb_driver()
Input: convert I2C drivers to use module_i2c_driver()
Input: convert SPI drivers to use module_spi_driver()
Input: omap4-keypad - move platform_data to <linux/platform_data>
Input: kxtj9 - who_am_i check value and initial data rate fixes
Input: add driver support for MAX8997-haptic
Input: tegra-kbc - revise device tree support
Input: of_keymap - add device tree bindings for simple key matrices
Input: wacom - fix physical size calculation for 3rd-gen Bamboo
Input: twl4030-vibra - really switch from #if to #ifdef
Input: hp680_ts_input - ensure arguments to request_irq and free_irq are compatible
Input: max8925_onkey - avoid accessing input device too early
Input: max8925_onkey - allow to be used as a wakeup source
Input: atmel-wm97xx - convert to dev_pm_ops
Input: atmel-wm97xx - set driver owner
Input: add cyttsp touchscreen maintainer entry
Input: cyttsp - remove useless checks in cyttsp_probe()
Input: usbtouchscreen - add support for Data Modul EasyTouch TP 72037
...
Since commit 7dffa3c673 the ntp
subsystem has used an hrtimer for triggering the leapsecond
adjustment. However, this can cause a potential livelock.
Thomas diagnosed this as the following pattern:
CPU 0 CPU 1
do_adjtimex()
spin_lock_irq(&ntp_lock);
process_adjtimex_modes(); timer_interrupt()
process_adj_status(); do_timer()
ntp_start_leap_timer(); write_lock(&xtime_lock);
hrtimer_start(); update_wall_time();
hrtimer_reprogram(); ntp_tick_length()
tick_program_event() spin_lock(&ntp_lock);
clockevents_program_event()
ktime_get()
seq = req_seqbegin(xtime_lock);
This patch tries to avoid the problem by reverting back to not using
an hrtimer to inject leapseconds, and instead we handle the leapsecond
processing in the second_overflow() function.
The downside to this change is that on systems that support highres
timers, the leap second processing will occur on a HZ tick boundary,
(ie: ~1-10ms, depending on HZ) after the leap second instead of
possibly sooner (~34us in my tests w/ x86_64 lapic).
This patch applies on top of tip/timers/core.
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Diagnoised-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
On x86, if CONFIG_DEBUG_RODATA is set, one cannot set breakpoints
via KDB. Apparently this is a well-known problem, as at least one distribution
now ships with both KDB enabled and CONFIG_DEBUG_RODATA=y for security reasons.
This patch adds an printk message to the breakpoint failure case,
in order to provide suggestions about how to use the debugger.
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Tim Bird <tim.bird@am.sony.com>
This fixes a bug with setting a breakpoint during kdb initialization
(from kdb_cmds). Any call to kdb_printf() before the initialization
of the kgdboc serial console driver (which happens much later during
bootup than kdb_init), results in kernel panic due to the use of
dbg_io_ops before it is initialized.
Signed-off-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Sometimes it is desirable to stop the kernel debugger before allowing
a system to reboot either with kdb or kgdb. This patch adds the
ability to turn the reboot notifier on and off or enter the debugger
and stop kernel execution before rebooting.
It is possible to change the setting after booting the kernel with the
following:
echo 1 > /sys/module/debug_core/parameters/kgdbreboot
It is also possible to change this setting using kdb / kgdb to
manipulate the variable directly.
Using KDB:
mm kgdbreboot 1
Using gdb:
set kgdbreboot=1
Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
This fixes the following problems:
1) Typematic-repeat of 'enter' gives warning message
and leaks make/break if KDB exits. Repeats
look something like 0x1c 0x1c .... 0x9c
2) Use of 'keypad enter' gives warning message and
leaks the ENTER break/make code out if KDB exits.
KP ENTER repeats look someting like 0xe0 0x1c
0xe0 0x1c ... 0xe0 0x9c.
3) Lag on the order of seconds between "break" and "make" when
expecting the enter "break" code. Seen under virtualized
environments such as VMware ESX.
The existing special enter handler tries to glob the enter break code,
but this fails if the other (KP) enter was used, or if there was a key
repeat. It also fails if you mashed some keys along with enter, and
you ended up with a non-enter make or non-enter break code coming
after the enter make code. So first, we modify the handler to handle
these cases. But performing these actions on every enter is annoying
since now you can't hold ENTER down to scroll <more>d messages in
KDB. Since this special behaviour is only necessary to handle the
exiting KDB ('g' + ENTER) without leaking scancodes to the OS. This
cleanup needs to get executed anytime the kdb_main loop exits.
Tested on QEMU. Set a bp on atkbd.c to verify no scan code was leaked.
Cc: Andrei Warkentin <andreiw@vmware.com>
[jason.wessel@windriver.com: move cleanup calls to kdb_main.c]
Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The gdbstub and kdb should get detached if the system is rebooting.
Calling gdbstub_exit() will set the proper debug core state and send a
message to any debugger that is connected to correctly detach.
An attached debugger will receive the exit code from
include/linux/reboot.h based on SYS_HALT, SYS_REBOOT, etc...
Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Not all kgdb I/O drivers implement a flush operation. Adjust
gdbstub_exit accordingly.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Merge first batch of patches from Andrew Morton:
"A few misc things and all the MM queue"
* emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits)
memcg: avoid THP split in task migration
thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
memcg: clean up existing move charge code
mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
mm/memcontrol.c: s/stealed/stolen/
memcg: fix performance of mem_cgroup_begin_update_page_stat()
memcg: remove PCG_FILE_MAPPED
memcg: use new logic for page stat accounting
memcg: remove PCG_MOVE_LOCK flag from page_cgroup
memcg: simplify move_account() check
memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
memcg: kill dead prev_priority stubs
memcg: remove PCG_CACHE page_cgroup flag
memcg: let css_get_next() rely upon rcu_read_lock()
cgroup: revert ss_id_lock to spinlock
idr: make idr_get_next() good for rcu_read_lock()
memcg: remove unnecessary thp check in page stat accounting
memcg: remove redundant returns
memcg: enum lru_list lru
...
Remove lock and unlock around css_get_next()'s call to idr_get_next().
memcg iterators (only users of css_get_next) already did rcu_read_lock(),
and its comment demands that; but add a WARN_ON_ONCE to make sure of it.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c1e2ee2dc4 ("memcg: replace ss->id_lock with a rwlock") has now
been seen to cause the unfair behavior we should have expected from
converting a spinlock to an rwlock: softlockup in cgroup_mkdir(), whose
get_new_cssid() is waiting for the wlock, while there are 19 tasks using
the rlock in css_get_next() to get on with their memcg workload (in an
artificial test, admittedly). Yet lib/idr.c was made suitable for RCU
way back: revert that commit, restoring ss->id_lock to a spinlock.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sync_mm_rss() can only be used for current to avoid race conditions in
iterating and clearing its per-task counters. Remove the task argument
for it and its helper function, __sync_task_rss_stat(), to avoid thinking
it can be used safely for anything other than current.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c0ff7453bb ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Warn about non-zero rss counters at final mmdrop.
This check will prevent reoccurences of bugs such as that fixed in "mm:
fix rss count leakage during migration".
I didn't hide this check under CONFIG_VM_DEBUG because it rather small and
rss counters cover whole page-table management, so this is a good
invariant.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile 1 from Al Viro:
"This is _not_ all; in particular, Miklos' and Jan's stuff is not there
yet."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
ext4: initialization of ext4_li_mtx needs to be done earlier
debugfs-related mode_t whack-a-mole
hfsplus: add an ioctl to bless files
hfsplus: change finder_info to u32
hfsplus: initialise userflags
qnx4: new helper - try_extent()
qnx4: get rid of qnx4_bread/qnx4_getblk
take removal of PF_FORKNOEXEC to flush_old_exec()
trim includes in inode.c
um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
um: embed ->stub_pages[] into mmu_context
gadgetfs: list_for_each_safe() misuse
ocfs2: fix leaks on failure exits in module_init
ecryptfs: make register_filesystem() the last potential failure exit
ntfs: forgets to unregister sysctls on register_filesystem() failure
logfs: missing cleanup on register_filesystem() failure
jfs: mising cleanup on register_filesystem() failure
make configfs_pin_fs() return root dentry on success
configfs: configfs_create_dir() has parent dentry in dentry->d_parent
configfs: sanitize configfs_create()
...
Pull security subsystem updates for 3.4 from James Morris:
"The main addition here is the new Yama security module from Kees Cook,
which was discussed at the Linux Security Summit last year. Its
purpose is to collect miscellaneous DAC security enhancements in one
place. This also marks a departure in policy for LSM modules, which
were previously limited to being standalone access control systems.
Chromium OS is using Yama, and I believe there are plans for Ubuntu,
at least.
This patchset also includes maintenance updates for AppArmor, TOMOYO
and others."
Fix trivial conflict in <net/sock.h> due to the jumo_label->static_key
rename.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
AppArmor: Fix location of const qualifier on generated string tables
TOMOYO: Return error if fails to delete a domain
AppArmor: add const qualifiers to string arrays
AppArmor: Add ability to load extended policy
TOMOYO: Return appropriate value to poll().
AppArmor: Move path failure information into aa_get_name and rename
AppArmor: Update dfa matching routines.
AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
AppArmor: Add const qualifiers to generated string tables
AppArmor: Fix oops in policy unpack auditing
AppArmor: Fix error returned when a path lookup is disconnected
KEYS: testing wrong bit for KEY_FLAG_REVOKED
TOMOYO: Fix mount flags checking order.
security: fix ima kconfig warning
AppArmor: Fix the error case for chroot relative path name lookup
AppArmor: fix mapping of META_READ to audit and quiet flags
AppArmor: Fix underflow in xindex calculation
AppArmor: Fix dropping of allowed operations that are force audited
AppArmor: Add mising end of structure test to caps unpacking
...
This branch takes the PowerPC irq_host infrastructure (reverse mapping
from Linux IRQ numbers to hardware irq numbering), generalizes it,
renames it to irq_domain, and makes it available to all architectures.
Originally the plan has been to create an all-new irq_domain
implementation which addresses some of the powerpc shortcomings such
as not handling 1:1 mappings well, but doing that proved to be far
more difficult and invasive than generalizing the working code and
refactoring it in-place. So, this branch rips out the 'new'
irq_domain and replaces it with the modified powerpc version (in a
fully bisectable way of course). It converts all users over to the
new API and makes irq_domain selectable on any architecture.
No architecture is forced to enable irq_domain, but the infrastructure
is required for doing OpenFirmware style irq translations. It will
even work on SPARC even though SPARC has it's own mechanism for
translating irqs at boot time. MIPS, microblaze, embedded x86 and c6x
are converted too.
The resulting irq_domain code is probably still too verbose and can be
optimized more, but that can be done incrementally and is a task for
follow-on patches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPZ1yiAAoJEEFnBt12D9kB4yIQAJvCfTPL65sCYVD6i9RnVHtR
ahwddtd0AtT+UYLU8Xg2fZgVi6cmupDGnqkBixzZD3xxSTERqm7Snqa0ugklfeAi
B6Zqf/K17H5hJNaoQ3fkNauow8m7ZYOeEH2vVUvkb3woWS9Wm7OGd+BvcIBgYSGe
Aaoumhu7kDxFkii0qz3x/+kvsb6DRp2HtSPWj+APL/kNjdiO4JBOihtcc/lX6d47
bsZLiEMzHUFV4ApJNwqmfDnf54oMrHmrRJxgQHIMjeJC5or9I3Do8wDGe/aTF5xO
5GVpxCQsTlJMjTBWlAFtpTwCJB6y76EHQrHc7WzLlq8OJSsxApOke8M0BzXFrfMy
CU7UUpTvNZTLpZibLCEQKemv1+oNOkfFylsHxfek2MCqx0W6W4FHEGV3qE/GtgV9
+vurA9hNNp7VM0FGRGigcUr3woYdHLdEVQrlnL7Z9AgBu1W44MZLaai7iRVZOeCT
ZQ9++v2PJJ8vHT8kdkgTdiRpnEhmv84MX/GBT7ilWFEMIVeT5zhGkIBojzNgyzGc
7cvermmM0P8h+unkDgmzmSbDxo0PboqVKeoO71AOBhA6MmR9iom7XkuNdHhoOwy2
4A5xT1srbhJDbuv15BBREBV24TywpZ4a1+4nwQT4L1fXe+HfCxeEWexGcKQMRcIt
dAelOHTQ+ZGkOKvXeW05
=ruGA
-----END PGP SIGNATURE-----
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull irq_domain support for all architectures from Grant Likely:
"Generialize powerpc's irq_host as irq_domain
This branch takes the PowerPC irq_host infrastructure (reverse mapping
from Linux IRQ numbers to hardware irq numbering), generalizes it,
renames it to irq_domain, and makes it available to all architectures.
Originally the plan has been to create an all-new irq_domain
implementation which addresses some of the powerpc shortcomings such
as not handling 1:1 mappings well, but doing that proved to be far
more difficult and invasive than generalizing the working code and
refactoring it in-place. So, this branch rips out the 'new'
irq_domain and replaces it with the modified powerpc version (in a
fully bisectable way of course). It converts all users over to the
new API and makes irq_domain selectable on any architecture.
No architecture is forced to enable irq_domain, but the infrastructure
is required for doing OpenFirmware style irq translations. It will
even work on SPARC even though SPARC has it's own mechanism for
translating irqs at boot time. MIPS, microblaze, embedded x86 and c6x
are converted too.
The resulting irq_domain code is probably still too verbose and can be
optimized more, but that can be done incrementally and is a task for
follow-on patches."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6: (31 commits)
dt: fix twl4030 for non-dt compile on x86
mfd: twl-core: Add IRQ_DOMAIN dependency
devicetree: Add empty of_platform_populate() for !CONFIG_OF_ADDRESS (sparc)
irq_domain: Centralize definition of irq_dispose_mapping()
irq_domain/mips: Allow irq_domain on MIPS
irq_domain/x86: Convert x86 (embedded) to use common irq_domain
ppc-6xx: fix build failure in flipper-pic.c and hlwd-pic.c
irq_domain/microblaze: Convert microblaze to use irq_domains
irq_domain/powerpc: Replace custom xlate functions with library functions
irq_domain/powerpc: constify irq_domain_ops
irq_domain/c6x: Use library of xlate functions
irq_domain/c6x: constify irq_domain structures
irq_domain/c6x: Convert c6x to use generic irq_domain support.
irq_domain: constify irq_domain_ops
irq_domain: Create common xlate functions that device drivers can use
irq_domain: Remove irq_domain_add_simple()
irq_domain: Remove 'new' irq_domain in favour of the ppc one
mfd: twl-core.c: Fix the number of interrupts managed by twl4030
of/address: add empty static inlines for !CONFIG_OF
irq_domain: Add support for base irq and hwirq in legacy mappings
...
Assorted extensions and fixes including:
* Introduction of early/late suspend/hibernation device callbacks.
* Generic PM domains extensions and fixes.
* devfreq updates from Axel Lin and MyungJoo Ham.
* Device PM QoS updates.
* Fixes of concurrency problems with wakeup sources.
* System suspend and hibernation fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJPZww5AAoJEKhOf7ml8uNsiBYQAL9YGso7KypZhLspNxvAKuZr
iHyme2F7OdOiUfo40DVH5tRuEsQvLOl0S+9ukWLrzQotKBsMfym05jtbGN9m6Ygh
Z793sx3eRI3mltekJ9yrOxH6BOBDMWMkwY8ztU/X5aYDNirgJ/qtAjSK4BvWXBrz
APeaUReVnLdaNP8SnhHfne/KPsHk++NKZvAAva7E6RwtZn4KV6bfiBPGb8yvY8pP
m4cg1S5QEduMy+zQJ8+IlEHR91bt9spUyRwbhw6ZHCNzNeu4iEZT8DVt1O1sIRbO
LsNcClqsd40nr781SoF8N9GmGUxlUDr46bS3FSsDkYzn8uyxGEsv00edJZtPwIm5
7nPuYat3Ke1YsON0Kcd/wkBGXqw/Rjfp3F1bnHjpVx/0oM/6MPrFNnIwvpHspejG
kN3770idYJ17dLckhcsbYsLdy8yirITILDzvHT0AAaZ9z4Lr9Pm56WwFZLyb/lhR
2cqK8Bb8W9YvcVsKV8YqkyBVrygWMe+c56KoAoUBiSNxvW6LphmXFBj5QiFMs8s8
Xh8H7xU96FKbpNMIAZ1+bpI4zgulQG4xPXI9pKbhMfjaMUgj2zQeO8/t0WlB1M0z
+kEUcYHJnXrRrObQuHEFXZdIjy/E0fdUboMIrlLt0gm97OxnG6imPseQp6/leQkC
t+L4Aq6TOUofUU86d4cI
=IGhc
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates for 3.4 from Rafael Wysocki:
"Assorted extensions and fixes including:
* Introduction of early/late suspend/hibernation device callbacks.
* Generic PM domains extensions and fixes.
* devfreq updates from Axel Lin and MyungJoo Ham.
* Device PM QoS updates.
* Fixes of concurrency problems with wakeup sources.
* System suspend and hibernation fixes."
* tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits)
PM / Domains: Check domain status during hibernation restore of devices
PM / devfreq: add relation of recommended frequency.
PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on()
PM / shmobile: Make CMT driver use pm_genpd_dev_always_on()
PM / shmobile: Make TMU driver use pm_genpd_dev_always_on()
PM / Domains: Introduce "always on" device flag
PM / Domains: Fix hibernation restore of devices, v2
PM / Domains: Fix handling of wakeup devices during system resume
sh_mmcif / PM: Use PM QoS latency constraint
tmio_mmc / PM: Use PM QoS latency constraint
PM / QoS: Make it possible to expose PM QoS latency constraints
PM / Sleep: JBD and JBD2 missing set_freezable()
PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case
PM / Freezer: Remove references to TIF_FREEZE in comments
PM / Sleep: Add more wakeup source initialization routines
PM / Hibernate: Enable usermodehelpers in hibernate() error path
PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
PM / Sleep: Fix race conditions related to wakeup source timer function
PM / Sleep: Fix possible infinite loop during wakeup source destruction
PM / Hibernate: print physical addresses consistently with other parts of kernel
...
Pull kmap_atomic cleanup from Cong Wang.
It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().
Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.
* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
drbd: remove the second argument of k[un]map_atomic()
zcache: remove the second argument of k[un]map_atomic()
gma500: remove the second argument of k[un]map_atomic()
dm: remove the second argument of k[un]map_atomic()
tomoyo: remove the second argument of k[un]map_atomic()
sunrpc: remove the second argument of k[un]map_atomic()
rds: remove the second argument of k[un]map_atomic()
net: remove the second argument of k[un]map_atomic()
mm: remove the second argument of k[un]map_atomic()
lib: remove the second argument of k[un]map_atomic()
power: remove the second argument of k[un]map_atomic()
kdb: remove the second argument of k[un]map_atomic()
udf: remove the second argument of k[un]map_atomic()
ubifs: remove the second argument of k[un]map_atomic()
squashfs: remove the second argument of k[un]map_atomic()
reiserfs: remove the second argument of k[un]map_atomic()
ocfs2: remove the second argument of k[un]map_atomic()
ntfs: remove the second argument of k[un]map_atomic()
...
Pull trivial tree from Jiri Kosina:
"It's indeed trivial -- mostly documentation updates and a bunch of
typo fixes from Masanari.
There are also several linux/version.h include removals from Jesper."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
kcore: fix spelling in read_kcore() comment
constify struct pci_dev * in obvious cases
Revert "char: Fix typo in viotape.c"
init: fix wording error in mm_init comment
usb: gadget: Kconfig: fix typo for 'different'
Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
writeback: fix typo in the writeback_control comment
Documentation: Fix multiple typo in Documentation
tpm_tis: fix tis_lock with respect to RCU
Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
Doc: Update numastat.txt
qla4xxx: Add missing spaces to error messages
compiler.h: Fix typo
security: struct security_operations kerneldoc fix
Documentation: broken URL in libata.tmpl
Documentation: broken URL in filesystems.tmpl
mtd: simplify return logic in do_map_probe()
mm: fix comment typo of truncate_inode_pages_range
power: bq27x00: Fix typos in comment
...
Pull workqueue changes from Tejun Heo:
"This contains only one commit which cleans up UP allocation path a
bit."
* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: use percpu allocator for cwq on UP
Pull cgroup changes from Tejun Heo:
"Out of the 8 commits, one fixes a long-standing locking issue around
tasklist walking and others are cleanups."
* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
cgroup: remove cgroup_subsys argument from callbacks
cgroup: remove extra calls to find_existing_css_set
cgroup: replace tasklist_lock with rcu_read_lock
cgroup: simplify double-check locking in cgroup_attach_proc
cgroup: move struct cgroup_pidlist out from the header file
cgroup: remove cgroup_attach_task_current_cg()
exit_notify() changes ->exit_signal if the parent already did exec.
This doesn't really work, we are not going to send the signal now
if there is another live thread or the exiting task is traced. The
parent can exec before the last dies or the tracer detaches.
Move this check into do_notify_parent() which actually sends the
signal.
The user-visible change is that we do not change ->exit_signal,
and thus the exiting task is still "clone children" for
do_wait()->eligible_child(__WCLONE). Hopefully this is fine, the
current logic is racy anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
exit_notify() checks "tsk->self_exec_id != tsk->parent_exec_id"
to handle the "we have changed execution domain" case.
We can change do_thread() to always set ->exit_signal = SIGCHLD
and remove this check to simplify the code.
We could change setup_new_exec() instead, this looks more logical
because it increments ->self_exec_id. But note that de_thread()
already resets ->exit_signal if it changes the leader, let's keep
both changes close to each other.
Note that we change ->exit_signal lockless, this changes the rules.
Thereafter ->exit_signal is not stable under tasklist but this is
fine, the only possible change is OLDSIG -> SIGCHLD. This can race
with eligible_child() but the race is harmless. We can race with
reparent_leader() which changes our ->exit_signal in parallel, but
it does the same change to SIGCHLD.
The noticeable user-visible change is that the execing task is not
"visible" to do_wait()->eligible_child(__WCLONE) right after exec.
To me this looks more logical, and this is consistent with mt case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The child must not control its ->exit_signal, it is the parent who
decides which signal the child should use for notification.
This means that CLONE_PARENT should not use "clone_flags & CSIGNAL",
the forking task is the sibling of the new process and their parent
doesn't control exit_signal in this case.
This patch uses ->exit_signal of the forking process, but perhaps
we should simply use SIGCHLD.
We read group_leader->exit_signal lockless, this can race with the
ORIGINAL_SIGNAL -> SIGCHLD transition, but this is fine.
Potentially this change allows to kill self_exec_id/parent_exec_id.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timer changes for v3.4 from Ingo Molnar
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
ntp: Fix integer overflow when setting time
math: Introduce div64_long
cs5535-clockevt: Allow the MFGPT IRQ to be shared
cs5535-clockevt: Don't ignore MFGPT on SMP-capable kernels
x86/time: Eliminate unused irq0_irqs counter
clocksource: scx200_hrt: Fix the build
x86/tsc: Reduce the TSC sync check time for core-siblings
timer: Fix bad idle check on irq entry
nohz: Remove ts->Einidle checks before restarting the tick
nohz: Remove update_ts_time_stat from tick_nohz_start_idle
clockevents: Leave the broadcast device in shutdown mode when not needed
clocksource: Load the ACPI PM clocksource asynchronously
clocksource: scx200_hrt: Convert scx200 to use clocksource_register_hz
clocksource: Get rid of clocksource_calc_mult_shift()
clocksource: dbx500: convert to clocksource_register_hz()
clocksource: scx200_hrt: use pr_<level> instead of printk
time: Move common updates to a function
time: Reorder so the hot data is together
time: Remove most of xtime_lock usage in timekeeping.c
ntp: Add ntp_lock to replace xtime_locking
...
Pull scheduler changes for v3.4 from Ingo Molnar
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
printk: Make it compile with !CONFIG_PRINTK
sched/x86: Fix overflow in cyc2ns_offset
sched: Fix nohz load accounting -- again!
sched: Update yield() docs
printk/sched: Introduce special printk_sched() for those awkward moments
sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
sched: Cleanup cpu_active madness
sched: Fix load-balance wreckage
sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
sched: Ditch per cgroup task lists for load-balancing
sched: Rename load-balancing fields
sched: Move load-balancing arguments into helper struct
sched/rt: Do not submit new work when PI-blocked
sched/rt: Prevent idle task boosting
sched/wait: Add __wake_up_all_locked() API
sched/rt: Document scheduler related skip-resched-check sites
sched/rt: Use schedule_preempt_disabled()
sched/rt: Add schedule_preempt_disabled()
sched/rt: Do not throttle when PI boosting
sched/rt: Keep period timer ticking when rt throttling is active
...
Pull perf events changes for v3.4 from Ingo Molnar:
- New "hardware based branch profiling" feature both on the kernel and
the tooling side, on CPUs that support it. (modern x86 Intel CPUs
with the 'LBR' hardware feature currently.)
This new feature is basically a sophisticated 'magnifying glass' for
branch execution - something that is pretty difficult to extract from
regular, function histogram centric profiles.
The simplest mode is activated via 'perf record -b', and the result
looks like this in perf report:
$ perf record -b any_call,u -e cycles:u branchy
$ perf report -b --sort=symbol
52.34% [.] main [.] f1
24.04% [.] f1 [.] f3
23.60% [.] f1 [.] f2
0.01% [k] _IO_new_file_xsputn [k] _IO_file_overflow
0.01% [k] _IO_vfprintf_internal [k] _IO_new_file_xsputn
0.01% [k] _IO_vfprintf_internal [k] strchrnul
0.01% [k] __printf [k] _IO_vfprintf_internal
0.01% [k] main [k] __printf
This output shows from/to branch columns and shows the highest
percentage (from,to) jump combinations - i.e. the most likely taken
branches in the system. "branches" can also include function calls
and any other synchronous and asynchronous transitions of the
instruction pointer that are not 'next instruction' - such as system
calls, traps, interrupts, etc.
This feature comes with (hopefully intuitive) flat ascii and TUI
support in perf report.
- Various 'perf annotate' visual improvements for us assembly junkies.
It will now recognize function calls in the TUI and by hitting enter
you can follow the call (recursively) and back, amongst other
improvements.
- Multiple threads/processes recording support in perf record, perf
stat, perf top - which is activated via a comma-list of PIDs:
perf top -p 21483,21485
perf stat -p 21483,21485 -ddd
perf record -p 21483,21485
- Support for per UID views, via the --uid paramter to perf top, perf
report, etc. For example 'perf top --uid mingo' will only show the
tasks that I am running, excluding other users, root, etc.
- Jump label restructurings and improvements - this includes the
factoring out of the (hopefully much clearer) include/linux/static_key.h
generic facility:
struct static_key key = STATIC_KEY_INIT_FALSE;
...
if (static_key_false(&key))
do unlikely code
else
do likely code
...
static_key_slow_inc();
...
static_key_slow_inc();
...
The static_key_false() branch will be generated into the code with as
little impact to the likely code path as possible. the
static_key_slow_*() APIs flip the branch via live kernel code patching.
This facility can now be used more widely within the kernel to
micro-optimize hot branches whose likelihood matches the static-key
usage and fast/slow cost patterns.
- SW function tracer improvements: perf support and filtering support.
- Various hardenings of the perf.data ABI, to make older perf.data's
smoother on newer tool versions, to make new features integrate more
smoothly, to support cross-endian recording/analyzing workflows
better, etc.
- Restructuring of the kprobes code, the splitting out of 'optprobes',
and a corner case bugfix.
- Allow the tracing of kernel console output (printk).
- Improvements/fixes to user-space RDPMC support, allowing user-space
self-profiling code to extract PMU counts without performing any
system calls, while playing nice with the kernel side.
- 'perf bench' improvements
- ... and lots of internal restructurings, cleanups and fixes that made
these features possible. And, as usual this list is incomplete as
there were also lots of other improvements
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits)
perf report: Fix annotate double quit issue in branch view mode
perf report: Remove duplicate annotate choice in branch view mode
perf/x86: Prettify pmu config literals
perf report: Enable TUI in branch view mode
perf report: Auto-detect branch stack sampling mode
perf record: Add HEADER_BRANCH_STACK tag
perf record: Provide default branch stack sampling mode option
perf tools: Make perf able to read files from older ABIs
perf tools: Fix ABI compatibility bug in print_event_desc()
perf tools: Enable reading of perf.data files from different ABI rev
perf: Add ABI reference sizes
perf report: Add support for taken branch sampling
perf record: Add support for sampling taken branch
perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK
x86/kprobes: Split out optprobe related code to kprobes-opt.c
x86/kprobes: Fix a bug which can modify kernel code permanently
x86/kprobes: Fix instruction recovery on optimized path
perf: Add callback to flush branch_stack on context switch
perf: Disable PERF_SAMPLE_BRANCH_* when not supported
perf/x86: Add LBR software filter support for Intel CPUs
...
Pull irq/core changes for v3.4 from Ingo Molnar
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Remove paranoid warnons and bogus fixups
genirq: Flush the irq thread on synchronization
genirq: Get rid of unnecessary IRQTF_DIED flag
genirq: No need to check IRQTF_DIED before stopping a thread handler
genirq: Get rid of unnecessary irqaction field in task_struct
genirq: Fix incorrect check for forced IRQ thread handler
softirq: Reduce invoke_softirq() code duplication
genirq: Fix long-term regression in genirq irq_set_irq_type() handling
x86-32/irq: Don't switch to irq stack for a user-mode irq
Pull RCU changes for v3.4 from Ingo Molnar. The major features of this
series are:
- making RCU more aggressive about entering dyntick-idle mode in order
to improve energy efficiency
- converting a few more call_rcu()s to kfree_rcu()s
- applying a number of rcutree fixes and cleanups to rcutiny
- removing CONFIG_SMP #ifdefs from treercu
- allowing RCU CPU stall times to be set via sysfs
- adding CPU-stall capability to rcutorture
- adding more RCU-abuse diagnostics
- updating documentation
- fixing yet more issues located by the still-ongoing top-to-bottom
inspection of RCU, this time with a special focus on the CPU-hotplug
code path.
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rcu: Stop spurious warnings from synchronize_sched_expedited
rcu: Hold off RCU_FAST_NO_HZ after timer posted
rcu: Eliminate softirq-mediated RCU_FAST_NO_HZ idle-entry loop
rcu: Add RCU_NONIDLE() for idle-loop RCU read-side critical sections
rcu: Allow nesting of rcu_idle_enter() and rcu_idle_exit()
rcu: Remove redundant check for rcu_head misalignment
PTR_ERR should be called before its argument is cleared.
rcu: Convert WARN_ON_ONCE() in rcu_lock_acquire() to lockdep
rcu: Trace only after NULL-pointer check
rcu: Call out dangers of expedited RCU primitives
rcu: Rework detection of use of RCU by offline CPUs
lockdep: Add CPU-idle/offline warning to lockdep-RCU splat
rcu: No interrupt disabling for rcu_prepare_for_idle()
rcu: Move synchronize_sched_expedited() to rcutree.c
rcu: Check for illegal use of RCU from offlined CPUs
rcu: Update stall-warning documentation
rcu: Add CPU-stall capability to rcutorture
rcu: Make documentation give more realistic rcutorture duration
rcutorture: Permit holding off CPU-hotplug operations during boot
rcu: Print scheduling-clock information on RCU CPU stall-warning messages
...
Pull core/locking changes for v3.4 from Ingo Molnar
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Simplify return logic
futex: Cover all PI opcodes with cmpxchg enabled check
Alexander pointed out that the warnons in the regular exit path are
bogus and the thread_mask one actually could be triggered when
__setup_irq() hands out that thread_mask again after __free_irq()
dropped irq_desc->lock.
Thinking more about it, neither IRQTF_RUNTHREAD nor the bit in
thread_mask can be set as this is the regular exit path. We come here
due to:
__free_irq()
remove action from desc
synchronize_irq()
kthread_stop()
So synchronize_irq() makes sure that the thread finished running and
cleaned up both the thread_active count and thread_mask. After that
point nothing can set IRQTF_RUNTHREAD on this action. So the warnons
and the cleanups are pointless.
Reported-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Ido Yariv <ido@wizery.com>
Link: http://lkml.kernel.org/r/20120315190755.GA6732@dhcp-26-207.brq.redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is no global irq lock which makes a syscall magically SMP
safe. Remove the outdated comment concerning do_settimeofday() as
well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
change_clocksource() fails to grab locks or call timekeeping_update(),
which leaves a race window for time inconsistencies.
This adds proper locking and a call to timekeeping_update() to fix this.
CC: Andy Lutomirski <luto@amacapital.net>
CC: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
CAP_SYS_ADMIN is already overloaded left and right, so to have more
fine-grained access control use CAP_SYS_RESOURCE here.
The CAP_SYS_RESOUCE is chosen because this prctl option allows a current
process to adjust some fields of memory map descriptor which rather
represents what the process owns: pointers to code, data, stack
segments, command line, auxiliary vector data and etc.
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'long secs' is passed as divisor to div_s64, which accepts a 32bit
divisor. On 64bit machines that value is trimmed back from 8 bytes
back to 4, causing a divide by zero when the number is bigger than
(1 << 32) - 1 and all 32 lower bits are 0.
Use div64_long() instead.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: johnstul@us.ibm.com
Link: http://lkml.kernel.org/r/1331829374-31543-2-git-send-email-levinsasha928@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit 3ccf3e8306 ("printk/sched: Introduce special
printk_sched() for those awkward moments") overlooked
an #ifdef, so move code around to respect these directives.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/1331811337.18960.179.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pull block fixes from Jens Axboe:
"Been sitting on this for a while, but lets get this out the door.
This fixes various important bugs for 3.3 final, along with a few more
trivial ones. Please pull!"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix ioc leak in put_io_context
block, sx8: fix pointer math issue getting fw version
Block: use a freezable workqueue for disk-event polling
drivers/block/DAC960: fix -Wuninitialized warning
drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
block: fix __blkdev_get and add_disk race condition
block: Fix setting bio flags in drivers (sd_dif/floppy)
block: Fix NULL pointer dereference in sd_revalidate_disk
block: exit_io_context() should call elevator_exit_icq_fn()
block: simplify ioc_release_fn()
block: replace icq->changed with icq->flags
The current implementation does not always flush the threaded handler
when disabling the irq. In case the irq handler was called, but the
threaded handler hasn't started running yet, the interrupt will be
flagged as pending, and the handler will not run. This implementation
has some issues:
First, if the interrupt is a wake source and flagged as pending, the
system will not be able to suspend.
Second, when quickly disabling and re-enabling the irq, the threaded
handler might continue to run after the irq is re-enabled without the
irq handler being called first. This might be an unexpected behavior.
In addition, it might be counter-intuitive that the threaded handler
will not be called even though the irq handler was called and returned
IRQ_WAKE_THREAD.
Fix this by always waiting for the threaded handler to complete in
synchronize_irq().
[ tglx: Massaged comments, added WARN_ONs and the missing
IRQTF_RUNTHREAD check in exit_irq_thread() ]
Signed-off-by: Ido Yariv <ido@wizery.com>
Link: http://lkml.kernel.org/r/1322843052-7166-1-git-send-email-ido@wizery.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When padata_do_parallel() is called from multiple cpus for the same
padata instance, we can get object reordering on sequence number wrap
because testing for sequence number wrap and reseting the sequence
number must happen atomically but is implemented with two atomic
operations. This patch fixes this by converting the sequence number
from atomic_t to an unsigned int and protect the access with a
spin_lock. As a side effect, we get rid of the sequence number wrap
handling because the seqence number wraps back to null now without
the need to do anything.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When a padata object is queued to the serialization queue, another
cpu might process and free the padata object. So don't dereference
it after queueing to the serialization queue.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Uprobes uses exception notifiers to get to know if a thread hit
a breakpoint or a singlestep exception.
When a thread hits a uprobe or is singlestepping post a uprobe
hit, the uprobe exception notifier sets its TIF_UPROBE bit,
which will then be checked on its return to userspace path
(do_notify_resume() ->uprobe_notify_resume()), where the
consumers handlers are run (in task context) based on the
defined filters.
Uprobe hits are thread specific and hence we need to maintain
information about if a task hit a uprobe, what uprobe was hit,
the slot where the original instruction was copied for xol so
that it can be singlestepped with appropriate fixups.
In some cases, special care is needed for instructions that are
executed out of line (xol). These are architecture specific
artefacts, such as handling RIP relative instructions on x86_64.
Since the instruction at which the uprobe was inserted is
executed out of line, architecture specific fixups are added so
that the thread continues normal execution in the presence of a
uprobe.
Postpone the signals until we execute the probed insn.
post_xol() path does a recalc_sigpending() before return to
user-mode, this ensures the signal can't be lost.
Uprobes relies on DIE_DEBUG notification to notify if a
singlestep is complete.
Adds x86 specific uprobe exception notifiers and appropriate
hooks needed to determine a uprobe hit and subsequent post
processing.
Add requisite x86 fixups for xol for uprobes. Specific cases
needing fixups include relative jumps (x86_64), calls, etc.
Where possible, we check and skip singlestepping the
breakpointed instructions. For now we skip single byte as well
as few multibyte nop instructions. However this can be extended
to other instructions too.
Credits to Oleg Nesterov for suggestions/patches related to
signal, breakpoint, singlestep handling code.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120313180011.29771.89027.sendpatchset@srdronam.in.ibm.com
[ Performed various cleanliness edits ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Today's -next fails to build for me:
CC kernel/trace/trace_export.o
In file included from kernel/trace/trace_export.c:197: kernel/trace/trace_entries.h:58: error: 'perf_ftrace_event_register' undeclared here (not in a function)
make[2]: *** [kernel/trace/trace_export.o] Error 1
make[1]: *** [kernel/trace] Error 2
make: *** [kernel] Error 2
because as of ced390 (ftrace, perf: Add support to use function
tracepoint in perf) perf_trace_event_register() is declared in trace.h
only if CONFIG_PERF_EVENTS is enabled but I don't have that set.
Ensure that we always have a definition of perf_trace_event_register()
by making the definition unconditional.
Link: http://lkml.kernel.org/r/1330426967-17067-1-git-send-email-broonie@opensource.wolfsonmicro.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When CONFIG_DYNAMIC_FTRACE is not set, some archs (ARM) test
the variable function_trace_function to determine if it should
call the function tracer. If it is not set to ftrace_stub, then
it will call the function and return, and not call the function
graph tracer.
But some of these archs (ARM) do not have the assembly code
to test if function tracing is enabled or not (quick stop of tracing)
and it calls the helper routine ftrace_test_stop_func() instead.
If function tracer is enabled and then disabled, the variable
ftrace_trace_function is still set to the helper routine
ftrace_test_stop_func(), and not to ftrace_stub. This will
prevent the function graph tracer from ever running.
Output before patch
/debug/tracing # echo function > current_tracer
/debug/tracing # echo function_graph > current_tracer
/debug/tracing # cat trace
Output after patch
/debug/tracing # echo function > current_tracer
/debug/tracing # echo function_graph > current_tracer
/debug/tracing # cat trace
0) ! 253.375 us | } /* irq_enter */
0) | generic_handle_irq() {
0) | handle_fasteoi_irq() {
0) 9.208 us | _raw_spin_lock();
0) | handle_irq_event() {
0) | handle_irq_event_percpu() {
Signed-off-by: Rajesh Bhagat <rajesh.lnx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This callback is called by the scheduler after rq->lock has been released
and interrupts enabled. It will be used in subsequent patches on the ARM
architecture.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Tested-by: Marc Zyngier <Marc.Zyngier@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/20120313110840.7b444deb6b1bb902c15f3cdf@canb.auug.org.au
Signed-off-by: Ingo Molnar <mingo@elte.hu>
bkpt doesnt seem to be a correct abbrevation for breakpoint.
Choice was between bp and breakpoint. Since bp can refer to
things other than breakpoint, use swbp to refer to breakpoints.
This is pure cleanup, no functional change intended.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120312092545.5379.91251.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If a function takes struct uprobe or struct arch_uprobe, then it
is passed as the first parameter.
This is pure cleanup, no functional change intended.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120312092530.5379.18394.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Various people reported nohz load tracking still being wrecked, but Doug
spotted the actual problem. We fold the nohz remainder in too soon,
causing us to loose samples and under-account.
So instead of playing catch-up up-front, always do a single load-fold
with whatever state we encounter and only then fold the nohz remainder
and play catch-up.
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: LesÅ=82aw Kope=C4=87 <leslaw.kopec@nasza-klasa.pl>
Reported-by: Aman Gupta <aman@tmm1.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-4v31etnhgg9kwd6ocgx3rxl8@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's a few awkward printk()s inside of scheduler guts that people
prefer to keep but really are rather deadlock prone. Fudge around it
by storing the text in a per-cpu buffer and poll it using the existing
printk_tick() handler.
This will drop output when its more frequent than once a tick, however
only the affinity thing could possible go that fast and for that just
one should suffice to notify the admin he's done something silly..
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-wua3lmkt3dg8nfts66o6brne@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The 'next_balance' field of 'nohz' idle balancer must be initialized
to jiffies. Since jiffies is initialized to negative 300 seconds the
'nohz' idle balancer does not run for the first 300s (5mins) after
bootup. If no new processes are spawed or no idle cycles happen, the
load on the cpus will remain unbalanced for that duration.
Signed-off-by: Diwakar Tundlam <dtundlam@nvidia.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1DD7BFEDD3147247B1355BEFEFE4665237994F30EF@HQMAIL04.nvidia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Stepan found:
CPU0 CPUn
_cpu_up()
__cpu_up()
boostrap()
notify_cpu_starting()
set_cpu_online()
while (!cpu_active())
cpu_relax()
<PREEMPT-out>
smp_call_function(.wait=1)
/* we find cpu_online() is true */
arch_send_call_function_ipi_mask()
/* wait-forever-more */
<PREEMPT-in>
local_irq_enable()
cpu_notify(CPU_ONLINE)
sched_cpu_active()
set_cpu_active()
Now the purpose of cpu_active is mostly with bringing down a cpu, where
we mark it !active to avoid the load-balancer from moving tasks to it
while we tear down the cpu. This is required because we only update the
sched_domain tree after we brought the cpu-down. And this is needed so
that some tasks can still run while we bring it down, we just don't want
new tasks to appear.
On cpu-up however the sched_domain tree doesn't yet include the new cpu,
so its invisible to the load-balancer, regardless of the active state.
So instead of setting the active state after we boot the new cpu (and
consequently having to wait for it before enabling interrupts) set the
cpu active before we set it online and avoid the whole mess.
Reported-by: Stepan Moskovchenko <stepanm@codeaurora.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 367456c ("sched: Ditch per cgroup task lists for
load-balancing") completely wrecked load-balancing due to
a few silly mistakes.
Correct those and remove more pointless code.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-zk04ihygwxn7qqrlpaf73b0r@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I notice that the commit bbddff makes percpu allocator can work on UP,
So we don't need the magic way for UP.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently IRQTF_DIED flag is set when a IRQ thread handler calls do_exit()
But also PF_EXITING per process flag gets set when a thread exits. This
fix eliminates the duplicate by using PF_EXITING flag.
Also, there is a race condition in exit_irq_thread(). In case a thread's
bit is cleared in desc->threads_oneshot (and the IRQ line gets unmasked),
but before IRQTF_DIED flag is set, a new interrupt might come in and set
just cleared bit again, this time forever. This fix throws IRQTF_DIED flag
away, eliminating the race as a result.
[ tglx: Test THREAD_EXITING first as suggested by Oleg ]
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120309135958.GD2114@dhcp-26-207.brq.redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Since 63706172f3 kthread_stop() is not
afraid of dead kernel threads. So no need to check if a thread is
alive before stopping it. These checks still were racy.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120309135939.GC2114@dhcp-26-207.brq.redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When a new thread handler is created, an irqaction is passed to it as
data. Not only that irqaction is stored in task_struct by the handler
for later use, but also a structure associated with the kernel thread
keeps this value as long as the thread exists.
This fix kicks irqaction out off task_struct. Yes, I introduce new bit
field. But it allows not only to eliminate the duplicate, but also
shortens size of task_struct.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120309135925.GB2114@dhcp-26-207.brq.redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This reverts commit 8f2f748b06.
It causes some odd regression that we have not figured out, and it's too
late in the -rc series to try to figure it out now.
As reported by Konstantin Khlebnikov, it causes consistent hangs on his
laptop (Thinkpad x220: 2x cores + HT). They can be avoided by adding
calls to "rebuild_sched_domains();" in cpuset_cpu_[in]active() for the
CPU_{ONLINE/DOWN_FAILED/DOWN_PREPARE}_FROZEN cases, but it's not at all
clear why, and it makes no sense.
Konstantin's config doesn't even have CONFIG_CPUSETS enabled, just to
make things even more interesting. So it's not the cpusets, it's just
the scheduling domains.
So until this is understood, revert.
Bisected-reported-and-tested-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xommit ac5637611(genirq: Unmask oneshot irqs when thread was not woken)
fails to unmask when a !IRQ_ONESHOT threaded handler is handled by
handle_level_irq.
This happens because thread_mask is or'ed unconditionally in
irq_wake_thread(), but for !IRQ_ONESHOT interrupts never cleared. So
the check for !desc->thread_active fails and keeps the interrupt
disabled.
Keep the thread_mask zero for !IRQ_ONESHOT interrupts.
Document the thread_mask magic while at it.
Reported-and-tested-by: Sven Joachim <svenjoac@gmx.de>
Reported-and-tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ioc_task_link() is used to share %current's ioc on clone. If
%current->io_context is set, %current is guaranteed to have refcount
on the ioc and, thus, ioc_task_link() can't fail.
Replace error checking in ioc_task_link() with WARN_ON_ONCE() and make
it just increment refcount and nr_tasks.
-v2: Description typo fix (Vivek).
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The two invoke_softirq() variants are identical except for a single
line. So move the #ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED inside one of
the functions and get rid of the other one.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In 2008, commit 0c5d1eb77a ("genirq: record trigger type") modified the
way set_irq_type() handles the 'no trigger' condition. However, this has
an adverse effect on PCMCIA support on Intel StrongARM and probably PXA
platforms.
PCMCIA has several status signals on the socket which can trigger
interrupts; some of these status signals depend on the card's mode
(whether it is configured in memory or IO mode). For example, cards have
a 'Ready/IRQ' signal: in memory mode, this provides an indication to
PCMCIA that the card has finished its power up initialization. In IO
mode, it provides the device interrupt signal. Other status signals
switch between on-board battery status and loud speaker output.
In classical PCMCIA implementations, where you have a specific socket
controller, the controller provides a method to mask interrupts from the
socket, and importantly ignore any state transitions on the pins which
correspond with interrupts once masked. This masking prevents unwanted
events caused by the removal and application of socket power being
forwarded.
However, on platforms where there is no socket controller, the PCMCIA
status and interrupt signals are routed to standard edge-triggered GPIOs.
These GPIOs can be configured to interrupt on rising edge, falling edge,
or never. This is where the problems start.
Edge triggered interrupts are required to record events while disabled via
the usual methods of {free,request,disable,enable}_irq() to prevent
problems with dropped interrupts (eg, the 8390 driver uses disable_irq()
to defer the delivery of interrupts). As a result, these interfaces can
not be used to implement the desired behaviour.
The side effect of this is that if the 'Ready/IRQ' GPIO is disabled via
disable_irq() on suspend, and enabled via enable_irq() after resume, we
will record the state transitions caused by powering events as valid
interrupts, and foward them to the card driver, which may attempt to
access a card which is not powered up.
This leads delays resume while drivers spin in their interrupt handlers,
and complaints from drivers before they realize what's happened.
Moreover, in the case of the 'Ready/IRQ' signal, this is requested and
freed by the card driver itself; the PCMCIA core has no idea whether the
interrupt is requested, and, therefore, whether a call to disable_irq()
would be valid. (We tried this around 2.4.17 / 2.5.1 kernel era, and
ended up throwing it out because of this problem.)
Therefore, it was decided back in around 2002 to disable the edge
triggering instead, resulting in all state transitions on the GPIO being
ignored. That's what we actually need the hardware to do.
The commit above changes this behaviour; it explicitly prevents the 'no
trigger' state being selected.
The reason that request_irq() does not accept the 'no trigger' state is
for compatibility with existing drivers which do not provide their desired
triggering configuration. The set_irq_type() function is 'new' and not
used by non-trigger aware drivers.
Therefore, revert this change, and restore previously working platforms
back to their former state.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: linux@arm.linux.org.uk
Cc: Ingo Molnar <mingo@elte.hu>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
"softlockup: check all tasks in hung_task" commit ce9dbe24 looks
absolutely wrong.
- rcu_lock_break() does put_task_struct(). If the task has exited
it is not safe to even read its ->state, nothing protects this
task_struct.
- The TASK_DEAD checks are wrong too. Contrary to the comment, we
can't use it to check if the task was unhashed. It can be unhashed
without TASK_DEAD, or it can be valid with TASK_DEAD.
For example, an autoreaping task can do release_task(current)
long before it sets TASK_DEAD in do_exit().
Or, a zombie task can have ->state == TASK_DEAD but release_task()
was not called, and in this case we must not break the loop.
Change this code to check pid_alive() instead, and do this before we drop
the reference to the task_struct.
Note: while_each_thread() under rcu_read_lock() is not really safe, it can
livelock. This will be fixed later, but fortunately in this case the
"max_count" logic saves us anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mandeep Singh Baines <msb@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously it was (ab)used by utrace. Then it was wrongly used by the
scheduler code.
Currently it is not used, kill it before it finds the new erroneous user.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that CLONE_VFORK is killable, coredump_wait() no longer needs
complete_vfork_done(). zap_threads() should find and kill all tasks with
the same ->mm, this includes our parent if ->vfork_done is set.
mm_release() becomes the only caller, unexport complete_vfork_done().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make vfork() killable.
Change do_fork(CLONE_VFORK) to do wait_for_completion_killable(). If it
fails we do not return to the user-mode and never touch the memory shared
with our child.
However, in this case we should clear child->vfork_done before return, we
use task_lock() in do_fork()->wait_for_vfork_done() and
complete_vfork_done() to serialize with each other.
Note: now that we use task_lock() we don't really need completion, we
could turn task->vfork_done into "task_struct *wake_up_me" but this needs
some complications.
NOTE: this and the next patches do not affect in-kernel users of
CLONE_VFORK, kernel threads run with all signals ignored including
SIGKILL/SIGSTOP.
However this is obviously the user-visible change. Not only a fatal
signal can kill the vforking parent, a sub-thread can do execve or
exit_group() and kill the thread sleeping in vfork().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes.
Move the clear-and-complete-vfork_done code into the new trivial helper,
complete_vfork_done().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
register_kprobe() aborts if the address of the new request falls in a
prohibited area (such as ftrace pouch, __kprobes annotated functions,
non-kernel text addresses, jump label text). We however don't return the
right error on this abort, resulting in a silent failure - incorrect
adding/reporting of kprobes ('perf probe do_fork+18' or 'perf probe
mcount' for instance).
In V2 we are incorporating Masami Hiramatsu's feedback.
This patch fixes it by returning -EINVAL upon failure.
While we are here, rename the label used for exit to be more appropriate.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Prashanth K Nageshappa <prashanth@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 04c6862c05 ("kmsg_dump: add kmsg_dump() calls to the
reboot, halt, poweroff and emergency_restart paths"), kmsg_dump() gets
run on normal paths including poweroff and reboot.
This is less than ideal given pstore implementations that can only
represent single backtraces, since a reboot may overwrite a stored oops
before it's been picked up by userspace. In addition, some pstore
backends may have low performance and provide a significant delay in
reboot as a result.
This patch adds a printk.always_kmsg_dump kernel parameter (which can also
be changed from userspace). Without it, the code will only be run on
failure paths rather than on normal paths. The option can be enabled in
environments where there's a desire to attempt to audit whether or not a
reboot was cleanly requested or not.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With branch stack sampling, it is possible to filter by priv levels.
In system-wide mode, that means it is possible to capture only user
level branches. The builtin SW LBR filter needs to disassemble code
based on LBR captured addresses. For that, it needs to know the task
the addresses are associated with. Because of context switches, the
content of the branch stack buffer may contain addresses from
different tasks.
We need a callback on context switch to either flush the branch stack
or save it. This patch adds a new callback in struct pmu which is called
during context switches. The callback is called only when necessary.
That is when a system-wide context has, at least, one event which
uses PERF_SAMPLE_BRANCH_STACK. The callback is never called for
per-thread context.
In this version, the Intel x86 code simply flushes (resets) the LBR
on context switches (fills it with zeroes). Those zeroed branches are
then filtered out by the SW filter.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-11-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the ability to sample taken branches to the
perf_event interface.
The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.
This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel x86 it is
implemented on top of the Last Branch Record (LBR) facility.
To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.
Sampled taken branches may be filtered by type and/or priv
levels.
The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.
Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.
The following generic filters are currently defined:
- PERF_SAMPLE_USER
only branches whose targets are at the user level
- PERF_SAMPLE_KERNEL
only branches whose targets are at the kernel level
- PERF_SAMPLE_HV
only branches whose targets are at the hypervisor level
- PERF_SAMPLE_ANY
any type of branches (subject to priv levels filters)
- PERF_SAMPLE_ANY_CALL
any call branches (may incl. syscall on some arch)
- PERF_SAMPLE_ANY_RET
any return branches (may incl. syscall returns on some arch)
- PERF_SAMPLE_IND_CALL
indirect call branches
Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event. Use of kernel, hv priv levels is subject to permissions
and availability (hv).
The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch removes all the references in the code about the TIF_FREEZE
flag removed by commit a3201227f8
freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE
There still are some references to TIF_FREEZE in
Documentation/power/freezing-of-tasks.txt, but it looks like that
documentation needs more thorough work to reflect how the new
freezer works, and hence merely removing the references to TIF_FREEZE
won't really help. So I have not touched that part in this patch.
Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
If create_basic_memory_bitmaps() fails, usermodehelpers are not re-enabled
before returning. Fix this. And while at it, reword the goto labels so that
they look more meaningful.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Pulling latest branches from Ingo:
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
memblock: Fix size aligning of memblock_alloc_base_nid()
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf probe: Ensure offset provided is not greater than function length without DWARF info too
perf tools: Ensure comm string is properly terminated
perf probe: Ensure offset provided is not greater than function length
perf evlist: Return first evsel for non-sample event on old kernel
perf/hwbp: Fix a possible memory leak
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume
Pass nice as a value to proc_sched_autogroup_set_nice().
No side effect is expected, and the variable err will be overwritten with
the return value.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F45FBB7.5090607@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch (as1519) fixes a bug in the block layer's disk-events
polling. The polling is done by a work routine queued on the
system_nrt_wq workqueue. Since that workqueue isn't freezable, the
polling continues even in the middle of a system sleep transition.
Obviously, polling a suspended drive for media changes and such isn't
a good thing to do; in the case of USB mass-storage devices it can
lead to real problems requiring device resets and even re-enumeration.
The patch fixes things by creating a new system-wide, non-reentrant,
freezable workqueue and using it for disk-events polling.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: <stable@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As ftrace_dump() (called by ftrace_dump_on_oops) disables interrupts
as it dumps its output to the console, it can keep interrupts disabled
for long periods of time. This is likely to trigger the NMI watchdog,
and it can disrupt the output of critical data.
Add a touch_nmi_watchdog() to each event that is written to the screen
to keep the NMI watchdog from affecting the output.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Per cgroup load-balance has numerous problems, chief amongst them that
there is no real sane order in them. So stop pretending it makes sense
and enqueue all tasks on a single list.
This also allows us to more easily fix the fwd progress issue
uncovered by the lock-break stuff. Rotate the list on failure to
migreate and limit the total iterations to nr_running (which with
releasing the lock isn't strictly accurate but close enough).
Also add a filter that skips very light tasks on the first attempt
around the list, this attempts to avoid shooting whole cgroups around
without affecting over balance.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Link: http://lkml.kernel.org/n/tip-tx8yqydc7eimgq7i4rkc3a4g@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Passing large sets of similar arguments all around the load-balancer
gets tiresom when you want to modify something. Stick them all in a
helper structure and pass the structure around.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Link: http://lkml.kernel.org/n/tip-5slqz0vhsdzewrfk9eza1aon@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we are PI-blocked then we want to get things done ASAP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-vw8et3445km5b8mpihf4trae@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Idle task boosting is a nono in general. There is one
exception, when PREEMPT_RT and NOHZ is active:
The idle task calls get_next_timer_interrupt() and holds
the timer wheel base->lock on the CPU and another CPU wants
to access the timer (probably to cancel it). We can safely
ignore the boosting request, as the idle CPU runs this code
with interrupts disabled and will complete the lock
protected section without being interrupted. So there is no
real need to boost.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-755rvsosz7sdzot12a3gbha6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For code which protects the waitqueue itself with another lock it
makes no sense to acquire the waitqueue lock for wakeup all. Provide
__wake_up_all_locked().
This is an optimization on the vanilla kernel (to be used by the
PCI code) and an important semantic distinction on -rt.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-ux6m4b8jonb9inx8xafh77ds@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Create a distinction between scheduler related preempt_enable_no_resched()
calls and the nearly one hundred other places in the kernel that do not
want to reschedule, for one reason or another.
This distinction matters for -rt, where the scheduler and the non-scheduler
preempt models (and checks) are different. For upstream it's purely
documentational.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-gs88fvx2mdv5psnzxnv575ke@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add helper to get rid of the ever repeating:
preempt_enable_no_resched();
schedule();
preempt_disable();
patterns.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-wxx7btox7coby6ifv5vzhzgp@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a runqueue has rt_runtime_us = 0 then the only way it can
accumulate rt_time is via PI boosting. That causes the runqueue
to be throttled and replenishing does not change anything due to
rt_runtime_us = 0. So avoid that situation by clearing rt_time and
skip the throttling alltogether.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[ Changelog ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-7x70cypsotjb4jvcor3edctk@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a runqueue is throttled we cannot disable the period timer
because that timer is the only way to undo the throttling.
We got stale throttling entries when a rq was throttled and then the
global sysctl was disabled, which stopped the timer.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ Added changelog ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-nuj34q52p6ro7szapuz84i0v@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Different tree maintainers picked up independently generated
trivial compile fixes based on linux-next testing, resulting
in some cases where a file would have got more than one addition
of module.h once everything was all merged together.
Delete any duplicates so includecheck isn't complaining about
anything related to module.h/export.h changes.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
In the jump label enabled case, calling static_key_enabled()
results in a function call. The function returns the results of
a compare, so it really doesn't need the overhead of a full
function call. Let's make it 'static inline' for both the jump
label enabled and disabled cases.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: a.p.zijlstra@chello.nl
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/201202281849.q1SIn1p2023270@int-mx10.intmail.prod.int.phx2.redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The major features of this series are:
- making RCU more aggressive about entering dyntick-idle mode in order to
improve energy efficiency
- converting a few more call_rcu()s to kfree_rcu()s
- applying a number of rcutree fixes and cleanups to rcutiny
- removing CONFIG_SMP #ifdefs from treercu
- allowing RCU CPU stall times to be set via sysfs
- adding CPU-stall capability to rcutorture
- adding more RCU-abuse diagnostics
- updating documentation
- fixing yet more issues located by the still-ongoing top-to-bottom
inspection of RCU, this time with a special focus on the
CPU-hotplug code path.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If kzalloc() for TYPE_DATA failed on a given cpu, previous chunk
of TYPE_INST will be leaked. Fix it.
Thanks to Peter Zijlstra for suggesting this better solution. It
should work as long as the initial value of the region is all
0's and that's the case of static (per-cpu) memory allocation.
Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/1330391978-28070-1-git-send-email-namhyung.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Handle pending irqs in irq_startup()
genirq: Unmask oneshot irqs when thread was not woken
On PowerPC, FUNCTION_TRACER selects FRAME_POINTER, even
though the architecture does not support it.
This causes the following warning:
warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP && FUNCTION_TRACER && KMEMCHECK) selects FRAME_POINTER which has unmet direct dependencies (DEBUG_KERNEL && (CRIS || M68K || FRV || UML || AVR32 || SUPERH || BLACKFIN || MN10300) || ARCH_WANT_FRAME_POINTERS)
So remove the warning by adding the extra condition
"if !PPC" to FUNCTION_TRACER for FRAME_POINTER selection
Link: http://lkml.kernel.org/r/1330330101-8618-1-git-send-email-gerlando.falauto@keymile.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, during CPU hotplug, the cpuset callbacks modify the cpusets
to reflect the state of the system, and this handling is asymmetric.
That is, upon CPU offline, that CPU is removed from all cpusets. However
when it comes back online, it is put back only to the root cpuset.
This gives rise to a significant problem during suspend/resume. During
suspend, we offline all non-boot cpus and during resume we online them back.
Which means, after a resume, all cpusets (except the root cpuset) will be
restricted to just one single CPU (the boot cpu). But the whole point of
suspend/resume is to restore the system to a state which is as close as
possible to how it was before suspend.
So to fix this, don't touch cpusets during suspend/resume. That is, modify
the cpuset-related CPU hotplug callback to just ignore CPU hotplug when it
is initiated as part of the suspend/resume sequence.
Reported-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/4F460D7B.1020703@linux.vnet.ibm.com
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next change.
epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.
This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.
__cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
->signalfd_wqh is not empty, we add the new signalfd_cleanup()
helper.
ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.
The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.
In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.
Note:
- we do not have wake_up_all_poll() but wake_up_poll()
is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.
- signalfd_cleanup() uses POLLHUP along with POLLFREE,
we need a couple of simple changes in eventpoll.c to
make sure it can't be "lost".
Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes IRQ_DOMAIN usable on MIPS. It uses an ugly workaround
to preserve current behaviour so that MIPS has time to add irq_domain
registration to the irq controller drivers. The workaround will be
removed in Linux v3.6
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mips@linux-mips.org
So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.
Typical usage scenarios:
#include <linux/static_key.h>
struct static_key key = STATIC_KEY_INIT_TRUE;
if (static_key_false(&key))
do unlikely code
else
do likely code
Or:
if (static_key_true(&key))
do likely code
else
do unlikely code
The static key is modified via:
static_key_slow_inc(&key);
...
static_key_slow_dec(&key);
The 'slow' prefix makes it abundantly clear that this is an
expensive operation.
I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.
On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As the ring-buffer code is being used by other facilities in the
kernel, having tracing_on file disable *all* buffers is not a desired
affect. It should only disable the ftrace buffers that are being used.
Move the code into the trace.c file and use the buffer disabling
for tracing_on() and tracing_off(). This way only the ftrace buffers
will be affected by them and other kernel utilities will not be
confused to why their output suddenly stopped.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Current the initial SCHED_RR timeslice of init_task is HZ, which means
1s, and is not same as the default SCHED_RR timeslice DEF_TIMESLICE.
Change that initial timeslice to the DEF_TIMESLICE.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
[ s/DEF_TIMESLICE/RR_TIMESLICE/g ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F3C9995.3010800@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
select_idle_sibling() is called from select_task_rq_fair(), which
already has the RCU read lock held.
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120217030409.11748.12491.stgit@abhimanyu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 1ac9bc69 ("sched/tracing: Add a new tracepoint for sleeptime")
added a new sched:sched_stat_sleeptime tracepoint.
It's broken: the first sample we get on a task might be bad because
of a stale sleep_start value that wasn't reset at the last task switch
because the tracepoint was not active.
It also breaks the existing schedstat samples due to the side
effects of:
- se->statistics.sleep_start = 0;
...
- se->statistics.block_start = 0;
Nor do I see means to fix it without adding overhead to the scheduler
fast path, which I'm not willing to for the sake of redundant
instrumentation.
Most importantly, sleep time information can already be constructed
by tracing context switches and wakeups, and taking the timestamp
difference between the schedule-out, the wakeup and the schedule-in.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-pc4c9qhl8q6vg3bs4j6k0rbd@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add Peter Zijlstra's copyright to the uprobes code, whose
contributions to the uprobes code are not visible in the Git
history, because they were backmerged.
Also update existing copyright notices to the year 2012.
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-vjqxst502pc1efz7ah8cyht4@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Few cleanups suggested by Ingo Molnar.
- Rename struct uprobe_arch_info to struct arch_uprobe.
- Move insn from struct uprobe to struct arch_uprobe.
- Make arch specific uprobe functions to accept struct arch_uprobe
instead of struct uprobe.
- Move struct uprobe to kernel/uprobes.c from include/linux/uprobes.h
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Stone <jistone@redhat.com>
Link: http://lkml.kernel.org/r/20120222091602.15880.40249.sendpatchset@srdronam.in.ibm.com
[ Made various small improvements ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
uprobe_opcode_sz refers to the smallest instruction size for
that architecture. UPROBES_BKPT_INSN_SIZE refers to the size of
the breakpoint instruction for that architecture.
For now we are assuming that both uprobe_opcode_sz and
UPROBES_BKPT_INSN_SIZE are the same for all archs and hence
removing uprobe_opcode_sz in favour of UPROBES_BKPT_INSN_SIZE.
However if we have to support architectures where the smallest
instruction size is different from the size of breakpoint
instruction, we may have to re-introduce uprobe_opcode_sz.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Stone <jistone@redhat.com>
Link: http://lkml.kernel.org/r/20120222091549.15880.67020.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
synchronize_sched_expedited() is spamming CONFIG_DEBUG_PREEMPT=y
users with an unintended warning from the cpu_is_offline() check: use
raw_smp_processor_id() instead of smp_processor_id() there.
Because the warning is under a get_online_cpus(), it is not possible
for any CPUs to go offline, though it is quite possible that the
task might migrate between the raw_smp_processor_id() and the check
of cpu_is_offline(). This is not a problem because the task cannot
migrate from an offline CPU to an online one or vice versa. The point
of the check is to verify that synchronize_sched_expedited() is not
called from an offline CPU, for example, from a CPU_DYING notifier, or,
more important, from an outgoing CPU making its way from its CPU_DYING
notifiers to the idle loop.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Walking through the tasklist in cgroup_enable_task_cg_list() inside
an RCU read side critical section is not enough because:
- RCU is not (yet) safe against while_each_thread()
- If we use only RCU, a forking task that has passed cgroup_post_fork()
without seeing use_task_css_set_links == 1 is not guaranteed to have
its child immediately visible in the tasklist if we walk through it
remotely with RCU. In this case it will be missing in its css_set's
task list.
Thus we need to traverse the list (unfortunately) under the
tasklist_lock. It makes us safe against while_each_thread() and also
make sure we see all forked task that have been added to the tasklist.
As a secondary effect, reading and writing use_task_css_set_links are
now well ordered against tasklist traversing and modification. The new
layout is:
CPU 0 CPU 1
use_task_css_set_links = 1 write_lock(tasklist_lock)
read_lock(tasklist_lock) add task to tasklist
do_each_thread() { write_unlock(tasklist_lock)
add thread to css set links if (use_task_css_set_links)
} while_each_thread() add thread to css set links
read_unlock(tasklist_lock)
If CPU 0 traverse the list after the task has been added to the tasklist
then it is correctly added to the css set links. OTOH if CPU 0 traverse
the tasklist before the new task had the opportunity to be added to the
tasklist because it was too early in the fork process, then CPU 1
catches up and add the task to the css set links after it added the task
to the tasklist. The right value of use_task_css_set_links is guaranteed
to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
afterward to return the correct value.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Remove the stale comment about RCU protection. Many callers
(all of them?) of cgroup_enable_task_cg_list() don't seem
to be in an RCU read side critical section. Besides, RCU is
not helpful to protect against while_each_thread().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit handles workloads that transition quickly between idle and
non-idle, and where the CPU's callbacks cannot be invoked, but where
RCU does not have anything immediate for the CPU to do. Without this
patch, the RCU_FAST_NO_HZ code can be invoked repeatedly on each entry
to idle. The commit sets the per-CPU rcu_dyntick_holdoff variable to
hold off further attempts for a tick.
Reported-by: "Abou Gazala, Neven M" <neven.m.abou.gazala@intel.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If a softirq is pending, the current CPU has RCU callbacks pending,
and RCU does not immediately need anything from this CPU, then the
current code resets the RCU_FAST_NO_HZ state machine. This means that
upon exit from the subsequent softirq handler, RCU_FAST_NO_HZ will
try really hard to force RCU into dyntick-idle mode. And if the same
conditions hold after a few tries (determined by RCU_IDLE_OPT_FLUSHES),
the same situation can repeat, possibly endlessly. This scenario is
not particularly good for battery lifetime.
This commit therefore suppresses the early exit from the RCU_FAST_NO_HZ
state machine in the case where there is a softirq pending. This change
forces the state machine to retain its memory, and to enter holdoff if
this condition persists.
Reported-by: "Abou Gazala, Neven M" <neven.m.abou.gazala@intel.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU, RCU-bh, and RCU-sched read-side critical sections are forbidden
in the inner idle loop, that is, between the rcu_idle_enter() and the
rcu_idle_exit() -- RCU will happily ignore any such read-side critical
sections. However, things like powertop need tracepoints in the inner
idle loop.
This commit therefore provides an RCU_NONIDLE() macro that can be used to
wrap code in the idle loop that requires RCU read-side critical sections.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Use of RCU in the idle loop is incorrect, quite a few instances of
just that have made their way into mainline, primarily event tracing.
The problem with RCU read-side critical sections on CPUs that RCU believes
to be idle is that RCU is completely ignoring the CPU, along with any
attempts and RCU read-side critical sections.
The approaches of eliminating the offending uses and of pushing the
definition of idle down beyond the offending uses have both proved
impractical. The new approach is to encapsulate offending uses of RCU
with rcu_idle_exit() and rcu_idle_enter(), but this requires nesting
for code that is invoked both during idle and and during normal execution.
Therefore, this commit modifies rcu_idle_enter() and rcu_idle_exit() to
permit nesting.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
There is now an unconditional check for rcu_head misalignment in
__call_rcu(), so remove the old conditional one in debug_rcu_head_queue().
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression e,e1;
constant c;
@@
*e = c
... when != e = e1
when != &e
when != true IS_ERR(e)
*PTR_ERR(e)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Fix a bonehead error introduced when adding event tracing to rcutorture.
Move the traces to follow the NULL-pointer checks.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The expedited RCU primitives can be quite useful, but they have some
high costs as well. This commit updates and creates docbook comments
calling out the costs, and updates the RCU documentation as well.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because newly offlined CPUs continue executing after completing the
CPU_DYING notifiers, they legitimately enter the scheduler and use
RCU while appearing to be offline. This calls for a more sophisticated
approach as follows:
1. RCU marks the CPU online during the CPU_UP_PREPARE phase.
2. RCU marks the CPU offline during the CPU_DEAD phase.
3. Diagnostics regarding use of read-side RCU by offline CPUs use
RCU's accounting rather than the cpu_online_map. (Note that
__call_rcu() still uses cpu_online_map to detect illegal
invocations within CPU_DYING notifiers.)
4. Offline CPUs are prevented from hanging the system by
force_quiescent_state(), which pays attention to cpu_online_map.
Some additional work (in a later commit) will be needed to
guarantee that force_quiescent_state() waits a full jiffy before
assuming that a CPU is offline, for example, when called from
idle entry. (This commit also makes the one-jiffy wait
explicit, since the old-style implicit wait can now be defeated
by RCU_FAST_NO_HZ and by rcutorture.)
This approach avoids the false positives encountered when attempting to
use more exact classification of CPU online/offline state.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It is illegal to use RCU from a CPU that has reported idleness or
offlinedness to RCU. However, it can be quite difficult to determine
from a stack trace whether or not a given CPU is idle or offline.
Therefore, this commit adds idle/offline diagnostics to the lockdep-RCU
error message.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_prepare_for_idle() function is always called with interrupts
disabled, so there is no reason to disable interrupts again within
rcu_prepare_for_idle(). Therefore, this commit removes all of the
interrupt disabling, also removing a latent disabling-unbalance bug.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that TREE_RCU and TREE_PREEMPT_RCU no longer do anything different
for the single-CPU case, there is no need for multiple definitions of
synchronize_sched_expedited(). It is no longer in any sense a plug-in,
so move it from kernel/rcutree_plugin.h to kernel/rcutree.c.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although it is legal to use RCU during early boot, it is anything
but legal to use RCU at runtime from an offlined CPU. After all, RCU
explicitly ignores offlined CPUs. This commit therefore adds checks
for runtime use of RCU from offlined CPUs.
These checks are not perfect, in particular, they can be subverted
through use of things like rcu_dereference_raw(). Note that it is not
possible to put checks in rcu_read_lock() and friends due to the fact
that these primitives are used in code that might be used under either
RCU or lock-based protection, which means that checking rcu_read_lock()
gets you fat piles of false positives.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Add module parameters to rcutorture that induce a CPU stall.
The stall_cpu parameter specifies how long to stall in seconds,
defaulting to zero, which indicates no stalling is to be undertaken.
The stall_cpu_holdoff parameter specifies how many seconds after
insmod (or boot, if rcutorture is built into the kernel) that this
stall is to start. The default value for stall_cpu_holdoff is ten
seconds.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When rcutorture is started automatically at boot time, it might well
also start CPU-hotplug operations at that time, which might not be
desirable. This commit therefore adds an rcutorture parameter that
allows CPU-hotplug operations to be held off for the specified number
of seconds after the start of boot.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There have been situations where RCU CPU stall warnings were caused by
issues in scheduling-clock timer initialization. To make it easier to
track these down, this commit causes the RCU CPU stall-warning messages
to print out the number of scheduling-clock interrupts taken in the
current grace period for each stalled CPU.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The default CONFIG_RCU_CPU_STALL_TIMEOUT value of 60 seconds has served
Linux users well for production use for quite some time. However, for
debugging, there will be more than three minutes between subsequent
stall-warning messages. This can be an annoyingly long wait if you
are trying to work out where the offending infinite loop is hiding.
Therefore, this commit provides a rcu_cpu_stall_timeout sysfs
parameter that may be adjusted at boot time and at runtime to speed
up debugging.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that both TINY_RCU and TINY_PREEMPT_RCU have been in place for awhile,
it is time to remove UP support from TREE_RCU, which is what this commit
does.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The inner idle loop is an extended quiescent state for all flavors
of RCU, but there have been recent bug involving use of RCU read-side
primitives from within the idle loop. Therefore, this commit enlists
lockdep-RCU to detect attempts to enter the inner idle loop while in
an RCU read-side critical section, emitting a lockdep-RCU splat if so.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The recent updates to RCU_CPU_FAST_NO_HZ have an rcu_needs_cpu() that
does more than just check for callbacks, so get the name for
rcu_preempt_needs_cpu() consistent with that change, now calling it
rcu_preempt_cpu_has_callbacks().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This is a port of commit #82e78d80 from TREE_PREEMPT_RCU to
TINY_PREEMPT_RCU.
This commit uses the fact that current->rcu_boost_mutex is set
any time that the RCU_READ_UNLOCK_BOOSTED flag is set in the
current->rcu_read_unlock_special bitmask. This allows tests of
the bit to be changed to tests of the pointer, which in turn allows
the RCU_READ_UNLOCK_BOOSTED flag to be eliminated.
Please note that the check of current->rcu_read_unlock_special need not
change because any time that RCU_READ_UNLOCK_BOOSTED was set, so was
RCU_READ_UNLOCK_BLOCKED. Therefore, __rcu_read_unlock() can continue
testing current->rcu_read_unlock_special for non-zero, as before.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This is a port to TINY_RCU of Peter Zijlstra's commit #ec433f0c5
The rcu_read_unlock_special() function relies on in_irq() to exclude
scheduler activity from interrupt level. This fails because exit_irq()
can invoke the scheduler after clearing the preempt_count() bits that
in_irq() uses to determine that it is at interrupt level. This situation
can result in failures as follows:
$task IRQ SoftIRQ
rcu_read_lock()
/* do stuff */
<preempt> |= UNLOCK_BLOCKED
rcu_read_unlock()
--t->rcu_read_lock_nesting
irq_enter();
/* do stuff, don't use RCU */
irq_exit();
sub_preempt_count(IRQ_EXIT_OFFSET);
invoke_softirq()
ttwu();
spin_lock_irq(&pi->lock)
rcu_read_lock();
/* do stuff */
rcu_read_unlock();
rcu_read_unlock_special()
rcu_report_exp_rnp()
ttwu()
spin_lock_irq(&pi->lock) /* deadlock */
rcu_read_unlock_special(t);
This can be triggered 'easily' because invoke_softirq() immediately does
a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff first,
but even without that the above happens.
Cure this by also excluding softirqs from the rcu_read_unlock_special()
handler and ensuring the force_irqthreads ksoftirqd/# wakeup is done
from full softirq context.
It is also necessary to delay the ->rcu_read_lock_nesting decrement until
after rcu_read_unlock_special(). This delay is handled by the commit
"Protect __rcu_read_unlock() against scheduler-using irq handlers".
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This is a port of commit #b0d3041 from TREE_RCU to TREE_PREEMPT_RCU.
Under some rare but real combinations of configuration parameters, RCU
callbacks are posted during early boot that use kernel facilities that are
not yet initialized. Therefore, when these callbacks are invoked, hard
hangs and crashes ensue. This commit therefore prevents RCU callbacks
from being invoked until after the scheduler is fully up and running,
as in after multiple tasks have been spawned.
It might well turn out that a better approach is to identify the specific
RCU callbacks that are causing this problem, but that discussion will
wait until such time as someone really needs an RCU callback to be invoked
(as opposed to merely registered) during early boot.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This is a port of commit #be0e1e21 to TINY_PREEMPT_RCU. This uses
noinline to prevent rcu_read_unlock_special() from being inlined into
__rcu_read_unlock().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit ports commit #10f39bb1b2 (rcu: protect __rcu_read_unlock()
against scheduler-using irq handlers) from TREE_PREEMPT_RCU to
TINY_PREEMPT_RCU. The following is a corresponding port of that
commit message.
The addition of RCU read-side critical sections within runqueue and
priority-inheritance critical sections introduced some deadlocks,
for example, involving interrupts from __rcu_read_unlock() where the
interrupt handlers call wake_up(). This situation can cause the
instance of __rcu_read_unlock() invoked from interrupt to do some
of the processing that would otherwise have been carried out by the
task-level instance of __rcu_read_unlock(). When the interrupt-level
instance of __rcu_read_unlock() is called with a scheduler lock held from
interrupt-entry/exit situations where in_irq() returns false, deadlock can
result. Of course, in a UP kernel, there are not really any deadlocks,
but the upper-level critical section can still be be fatally confused
by the lower-level critical section changing things out from under it.
This commit resolves these deadlocks by using negative values of the
per-task ->rcu_read_lock_nesting counter to indicate that an instance of
__rcu_read_unlock() is in flight, which in turn prevents instances from
interrupt handlers from doing any special processing. Note that nested
rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will
never see ->rcu_read_lock_nesting go to zero, and will therefore never
invoke rcu_read_unlock_special(), thus preventing them from seeing the
RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special.
This patch also adds a check for ->rcu_read_unlock_special being negative
in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS
bit from being set should a scheduling-clock interrupt occur while
__rcu_read_unlock() is exiting from an outermost RCU read-side critical
section.
Of course, __rcu_read_unlock() can be preempted during the time that
->rcu_read_lock_nesting is negative. This could result in the setting
of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it,
and would also result it this task being queued on the corresponding
rcu_node structure's blkd_tasks list. Therefore, some later RCU read-side
critical section would enter rcu_read_unlock_special() to clean up --
which could result in deadlock (OK, OK, fatal confusion) if that RCU
read-side critical section happened to be in the scheduler where the
runqueue or priority-inheritance locks were held.
To prevent the possibility of fatal confusion that might result from
preemption during the time that ->rcu_read_lock_nesting is negative,
this commit also makes rcu_preempt_note_context_switch() check for
negative ->rcu_read_lock_nesting, thus refraining from queuing the task
(and from setting RCU_READ_UNLOCK_BLOCKED) if we are already exiting
from the outermost RCU read-side critical section (in other words,
we really are no longer actually in that RCU read-side critical
section). In addition, rcu_preempt_note_context_switch() invokes
rcu_read_unlock_special() to carry out the cleanup in this case, which
clears out the ->rcu_read_unlock_special bits and dequeues the task
(if necessary), in turn avoiding needless delay of the current RCU grace
period and needless RCU priority boosting.
It is still illegal to call rcu_read_unlock() while holding a scheduler
lock if the prior RCU read-side critical section has ever had both
preemption and irqs enabled. However, the common use case is legal,
namely where then entire RCU read-side critical section executes with
irqs disabled, for example, when the scheduler lock is held across the
entire lifetime of the RCU read-side critical section.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The grace-period initialization sequence in rcu_start_gp() has a special
case for systems where the rcu_node tree is a single rcu_node structure.
This made sense some years ago when systems were smaller and up to 64
CPUs could share a single rcu_node structure, but now that large systems
are common and a given leaf rcu_node structure can support only 16 CPUs
(due to lock contention on the rcu_node's ->lock field), this optimization
is almost never taken. And even the small mobile platforms that might
make use of it might rather have the kernel text reduction.
Therefore, this commit removes the check for single-rcu_node trees.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
RCU's current CPU-offline code path dumps all of the outgoing CPU's
callbacks onto the RCU_NEXT_TAIL portion of the surviving CPU's
callback list. This means that all the ready-to-invoke callbacks from
the outgoing CPU must wait for another full RCU grace period. This was
just fine when CPU-hotplug events were rare, but there is increasing
evidence that users are planning to make increasing use of CPU hotplug.
Therefore, this commit changes the callback-dumping procedure so that
callbacks that are ready to invoke are moved to the RCU_DONE_TAIL
portion of the surviving CPU's callback list. This avoids running
these callbacks through a second unnecessary grace period.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because quiescent states are now reported from offline CPUs in
CPU_DYING state, there is some possibility that such a CPU might
note the end of a grace period and attempt to start invoking
callbacks. This would be a very bad thing, and is supposed to
be prevented by the fact that the CPU_DYING CPU gets rid of all
its callbacks before reporting the quiescent state. However,
there is other CPU-offline code in the kernel, and it is quite
possible that someone will invoke RCU core processing from that
code. Therefore, this commit adds a warning for this case.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, a given CPU is permitted to remain in dyntick-idle mode
indefinitely if it has only lazy RCU callbacks queued. This is vulnerable
to corner cases in NUMA systems, so limit the time to six seconds by
default. (Currently controlled by a cpp macro.)
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Make rcutorture check for CPU-hotplug failures and complain if there
were any.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Move ->qsmaskinit and blkd_tasks[] manipulation to the CPU_DYING
notifier. This simplifies the code by eliminating a potential
deadlock and by reducing the responsibilities of force_quiescent_state().
Also rename functions to make their connection to the CPU-hotplug
stages explicit.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When CONFIG_RCU_FAST_NO_HZ is enabled, RCU will allow a given CPU to
enter dyntick-idle mode even if it still has RCU callbacks queued.
RCU avoids system hangs in this case by scheduling a timer for several
jiffies in the future. However, if all of the callbacks on that CPU
are from kfree_rcu(), there is no reason to wake the CPU up, as it is
not a problem to defer freeing of memory.
This commit therefore tracks the number of callbacks on a given CPU
that are from kfree_rcu(), and avoids scheduling the timer if all of
a given CPU's callbacks are from kfree_rcu().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The push for energy efficiency will require that RCU tag rcu_head
structures to indicate whether or not their invocation is time critical.
This tagging is best carried out in the bottom bits of the ->next
pointers in the rcu_head structures. This tagging requires that the
rcu_head structures be properly aligned, so this commit adds the required
diagnostics.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It is illegal to have a grace period within a same-flavor RCU read-side
critical section, so this commit adds lockdep-RCU checks to splat when
such abuse is encountered. This commit does not detect more elaborate
RCU deadlock situations. These situations might be a job for lockdep
enhancements.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Adding support to filter function trace event via perf
interface. It is now possible to use filter interface
in the perf tool like:
perf record -e ftrace:function --filter="(ip == mm_*)" ls
The filter syntax is restricted to the the 'ip' field only,
and following operators are accepted '==' '!=' '||', ending
up with the filter strings like:
ip == f1[, ]f2 ... || ip != f3[, ]f4 ...
with comma ',' or space ' ' as a function separator. If the
space ' ' is used as a separator, the right side of the
assignment needs to be enclosed in double quotes '"', e.g.:
perf record -e ftrace:function --filter '(ip == do_execve,sys_*,ext*)' ls
perf record -e ftrace:function --filter '(ip == "do_execve,sys_*,ext*")' ls
perf record -e ftrace:function --filter '(ip == "do_execve sys_* ext*")' ls
The '==' operator adds trace filter with same effect as would
be added via set_ftrace_filter file.
The '!=' operator adds trace filter with same effect as would
be added via set_ftrace_notrace file.
The right side of the '!=', '==' operators is list of functions
or regexp. to be added to filter separated by space.
The '||' operator is used for connecting multiple filter definitions
together. It is possible to have more than one '==' and '!='
operators within one filter string.
Link: http://lkml.kernel.org/r/1329317514-8131-8-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding FILTER_TRACE_FN event field type for function tracepoint
event, so it can be properly recognized within filtering code.
Currently all fields of ftrace subsystem events share the common
field type FILTER_OTHER. Since the function trace fields need
special care within the filtering code we need to recognize it
properly, hence adding the FILTER_TRACE_FN event type.
Adding filter parameter to the FTRACE_ENTRY macro, to specify the
filter field type for the event.
Link: http://lkml.kernel.org/r/1329317514-8131-7-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding perf registration support for the ftrace function event,
so it is now possible to register it via perf interface.
The perf_event struct statically contains ftrace_ops as a handle
for function tracer. The function tracer is registered/unregistered
in open/close actions.
To be efficient, we enable/disable ftrace_ops each time the traced
process is scheduled in/out (via TRACE_REG_PERF_(ADD|DELL) handlers).
This way tracing is enabled only when the process is running.
Intentionally using this way instead of the event's hw state
PERF_HES_STOPPED, which would not disable the ftrace_ops.
It is now possible to use function trace within perf commands
like:
perf record -e ftrace:function ls
perf stat -e ftrace:function ls
Allowed only for root.
Link: http://lkml.kernel.org/r/1329317514-8131-6-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding FTRACE_ENTRY_REG macro so particular ftrace entries
could specify registration function and thus become accesible
via perf.
This will be used in upcomming patch for function trace.
Link: http://lkml.kernel.org/r/1329317514-8131-5-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding TRACE_REG_PERF_ADD and TRACE_REG_PERF_DEL to handle
perf event schedule in/out actions.
The add action is invoked for when the perf event is scheduled in,
while the del action is invoked when the event is scheduled out.
Link: http://lkml.kernel.org/r/1329317514-8131-4-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding TRACE_REG_PERF_OPEN and TRACE_REG_PERF_CLOSE to differentiate
register/unregister from open/close actions.
The register/unregister actions are invoked for the first/last
tracepoint user when opening/closing the event.
The open/close actions are invoked for each tracepoint user when
opening/closing the event.
Link: http://lkml.kernel.org/r/1329317514-8131-3-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding a way to temporarily enable/disable ftrace_ops. The change
follows the same way as 'global' ftrace_ops are done.
Introducing 2 global ftrace_ops - control_ops and ftrace_control_list
which take over all ftrace_ops registered with FTRACE_OPS_FL_CONTROL
flag. In addition new per cpu flag called 'disabled' is also added to
ftrace_ops to provide the control information for each cpu.
When ftrace_ops with FTRACE_OPS_FL_CONTROL is registered, it is
set as disabled for all cpus.
The ftrace_control_list contains all the registered 'control' ftrace_ops.
The control_ops provides function which iterates ftrace_control_list
and does the check for 'disabled' flag on current cpu.
Adding 3 inline functions:
ftrace_function_local_disable/ftrace_function_local_enable
- enable/disable the ftrace_ops on current cpu
ftrace_function_local_disabled
- get disabled ftrace_ops::disabled value for current cpu
Link: http://lkml.kernel.org/r/1329317514-8131-2-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If more than one __print_*() function is used in a tracepoint
(__print_flags(), __print_symbols(), etc), then the temp seq buffer will
not be zero on entry. Using the temp seq buffer's length to know if
data has been printed or not in the current function is incorrect and
may produce incorrect results.
Currently, no in-tree tracepoint causes this bug, but new ones may
be created.
Cc: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If __print_flags() is used after another __print_*() function, the
temp seq_file buffer will not be empty on entry, and the delimiter will
be printed even though there's just one field. We get something like:
|S
instead of just:
S
This is because the length of the temp seq buffer is used to determine
if the delimiter is printed or not. But this algorithm fails when
the seq buffer is not empty on entry, and the delimiter will be printed
because it thinks that a previous field was already printed.
Link: http://lkml.kernel.org/r/1329650167-480655-1-git-send-email-avagin@openvz.org
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Assorted fixes, sat in -next for a week or so...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ocfs2: deal with wraparounds of i_nlink in ocfs2_rename()
vfs: fix compat_sys_stat() handling of overflows in st_nlink
quota: Fix deadlock with suspend and quotas
vfs: Provide function to get superblock and wait for it to thaw
vfs: fix panic in __d_lookup() with high dentry hashtable counts
autofs4 - fix lockdep splat in autofs
vfs: fix d_inode_lookup() dentry ref leak
Add helper functions to read and write struct timeval and struct
timespec from userspace. We already had helper functions for reading
and writing struct compat_timespec; add a set of functions to do the
same with struct timeval, and add a second suite of functions which
can be sensitive to COMPAT_USE_64BIT_TIME and access either 32- or
64-bit time structures.
This also exports these helper functions to modules.
Rename the existing inlines for converting between struct
compat_timeval and native struct timespec so we can have a saner
naming convention for the exported functions.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Replace the fd_sets in struct fdtable with an array of unsigned longs and then
use the standard non-atomic bit operations rather than the FD_* macros.
This:
(1) Removes the abuses of struct fd_set:
(a) Since we don't want to allocate a full fd_set the vast majority of the
time, we actually, in effect, just allocate a just-big-enough array of
unsigned longs and cast it to an fd_set type - so why bother with the
fd_set at all?
(b) Some places outside of the core fdtable handling code (such as
SELinux) want to look inside the array of unsigned longs hidden inside
the fd_set struct for more efficient iteration over the entire set.
(2) Eliminates the use of FD_*() macros in the kernel completely.
(3) Permits the __FD_*() macros to be deleted entirely where not exposed to
userspace.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.uk
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Print physical address info in a style consistent with the %pR style used
elsewhere in the kernel.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Since suspend_stats_update() is only called from pm_suspend(),
move its code directly into that function and remove the static
inline definition from include/linux/suspend.h. Clean_up
pm_suspend() in the process.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
The enter_state() function in kernel/power/suspend.c should be
static and state_store() in kernel/power/suspend.c should call
pm_suspend() instead of it, so make that happen (which also reduces
code duplication related to suspend statistics).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
The kerneldoc comments in kernel/power/suspend.c are not formatted
in the same way and the quality of some of them is questionable.
Unify the formatting and improve the contents.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Make the uprobes code readable to me:
- improve the Kconfig text so that a mere mortal gets some idea
what CONFIG_UPROBES=y is really about
- do trivial renames to standardize around the uprobes_*() namespace
- clean up and simplify various code flow details
- separate basic blocks of functionality
- line break artifact and white space related removal
- use standard local varible definition blocks
- use vertical spacing to make things more readable
- remove unnecessary volatile
- restructure comment blocks to make them more uniform and
more readable in general
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Link: http://lkml.kernel.org/n/tip-ewbwhb8o6navvllsauu7k07p@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add uprobes support to the core kernel, with x86 support.
This commit adds the kernel facilities, the actual uprobes
user-space ABI and perf probe support comes in later commits.
General design:
Uprobes are maintained in an rb-tree indexed by inode and offset
(the offset here is from the start of the mapping). For a unique
(inode, offset) tuple, there can be at most one uprobe in the
rb-tree.
Since the (inode, offset) tuple identifies a unique uprobe, more
than one user may be interested in the same uprobe. This provides
the ability to connect multiple 'consumers' to the same uprobe.
Each consumer defines a handler and a filter (optional). The
'handler' is run every time the uprobe is hit, if it matches the
'filter' criteria.
The first consumer of a uprobe causes the breakpoint to be
inserted at the specified address and subsequent consumers are
appended to this list. On subsequent probes, the consumer gets
appended to the existing list of consumers. The breakpoint is
removed when the last consumer unregisters. For all other
unregisterations, the consumer is removed from the list of
consumers.
Given a inode, we get a list of the mms that have mapped the
inode. Do the actual registration if mm maps the page where a
probe needs to be inserted/removed.
We use a temporary list to walk through the vmas that map the
inode.
- The number of maps that map the inode, is not known before we
walk the rmap and keeps changing.
- extending vm_area_struct wasn't recommended, it's a
size-critical data structure.
- There can be more than one maps of the inode in the same mm.
We add callbacks to the mmap methods to keep an eye on text vmas
that are of interest to uprobes. When a vma of interest is mapped,
we insert the breakpoint at the right address.
Uprobe works by replacing the instruction at the address defined
by (inode, offset) with the arch specific breakpoint
instruction. We save a copy of the original instruction at the
uprobed address.
This is needed for:
a. executing the instruction out-of-line (xol).
b. instruction analysis for any subsequent fixups.
c. restoring the instruction back when the uprobe is unregistered.
We insert or delete a breakpoint instruction, and this
breakpoint instruction is assumed to be the smallest instruction
available on the platform. For fixed size instruction platforms
this is trivially true, for variable size instruction platforms
the breakpoint instruction is typically the smallest (often a
single byte).
Writing the instruction is done by COWing the page and changing
the instruction during the copy, this even though most platforms
allow atomic writes of the breakpoint instruction. This also
mirrors the behaviour of a ptrace() memory write to a PRIVATE
file map.
The core worker is derived from KSM's replace_page() logic.
In essence, similar to KSM:
a. allocate a new page and copy over contents of the page that
has the uprobed vaddr
b. modify the copy and insert the breakpoint at the required
address
c. switch the original page with the copy containing the
breakpoint
d. flush page tables.
replace_page() is being replicated here because of some minor
changes in the type of pages and also because Hugh Dickins had
plans to improve replace_page() for KSM specific work.
Instruction analysis on x86 is based on instruction decoder and
determines if an instruction can be probed and determines the
necessary fixups after singlestep. Instruction analysis is done
at probe insertion time so that we avoid having to repeat the
same analysis every time a probe is hit.
A lot of code here is due to the improvement/suggestions/inputs
from Peter Zijlstra.
Changelog:
(v10):
- Add code to clear REX.B prefix as suggested by Denys Vlasenko
and Masami Hiramatsu.
(v9):
- Use insn_offset_modrm as suggested by Masami Hiramatsu.
(v7):
Handle comments from Peter Zijlstra:
- Dont take reference to inode. (expect inode to uprobe_register to be sane).
- Use PTR_ERR to set the return value.
- No need to take reference to inode.
- use PTR_ERR to return error value.
- register and uprobe_unregister share code.
(v5):
- Modified del_consumer as per comments from Peter.
- Drop reference to inode before dropping reference to uprobe.
- Use i_size_read(inode) instead of inode->i_size.
- Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called.
- Includes errno.h as recommended by Stephen Rothwell to fix a build issue
on sparc defconfig
- Remove restrictions while unregistering.
- Earlier code leaked inode references under some conditions while
registering/unregistering.
- Continue the vma-rmap walk even if the intermediate vma doesnt
meet the requirements.
- Validate the vma found by find_vma before inserting/removing the
breakpoint
- Call del_consumer under mutex_lock.
- Use hash locks.
- Handle mremap.
- Introduce find_least_offset_node() instead of close match logic in
find_uprobe
- Uprobes no more depends on MM_OWNER; No reference to task_structs
while inserting/removing a probe.
- Uses read_mapping_page instead of grab_cache_page so that the pages
have valid content.
- pass NULL to get_user_pages for the task parameter.
- call SetPageUptodate on the new page allocated in write_opcode.
- fix leaking a reference to the new page under certain conditions.
- Include Instruction Decoder if Uprobes gets defined.
- Remove const attributes for instruction prefix arrays.
- Uses mm_context to know if the application is 32 bit.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Also-written-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com
[ Made various small edits to the commit log ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make irq_domain_ops pointer a constant to make it safer for multiple
instances to share the same ops pointer and change the irq_domain code
so that it does not modify the ops.
v4: Fix mismatched type reference in powerpc code
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
Rather than having each interrupt controller driver creating its own barely
unique .xlate function for irq_domain, create a library of translators which
any driver can use directly.
v5: - Remove irq_domain_xlate_pci(). It was incorrect.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
irq_domain_add_simple() was a stop-gap measure until complete irq_domain
support was complete. This patch removes the irq_domain_add_simple()
interface.
This patch also drops the explicit irq_domain initialization performed
by the mach-versatile code because the versatile interrupt controller
already has irq_domain support built into it. This was a bug that was
hanging around quietly for a while, but with the full irq_domain which
actually verifies that irq_domain ranges are available it would cause
the registration to fail and the system wouldn't boot.
v4: Fixed number of irqs in mx5 gpio code
v2: Updated to pass in host_data pointer on irq_domain allocation.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Cc: Russell King <linux@arm.linux.org.uk>
Tested-by: Olof Johansson <olof@lixom.net>
This patch removes the simplistic implementation of irq_domains and enables
the powerpc infrastructure for all irq_domain users. The powerpc
infrastructure includes support for complex mappings between Linux and
hardware irq numbers, and can manage allocation of irq_descs.
This patch also converts the few users of irq_domain_add()/irq_domain_del()
to call irq_domain_add_legacy() instead.
v3: Fix bug that set up too many irqs in translation range.
v2: Fix removal of irq_alloc_descs() call in gic driver
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
Add support for a legacy mapping where irq = (hwirq - first_hwirq + first_irq)
so that a controller driver can allocate a fixed range of irq_descs and use
a simple calculation to translate back and forth between linux and hw irq
numbers. This is needed to use an irq_domain with many of the ARM interrupt
controller drivers that manage their own irq_desc allocations. Ultimately
the goal is to migrate those drivers to use the linear revmap, but doing it
this way allows each driver to be converted separately which makes the
migration path easier.
This patch generalizes the IRQ_DOMAIN_MAP_LEGACY method to use
(first_irq-first_hwirq) as the offset between hwirq and linux irq number,
and adds checks to make sure that the hwirq number does not exceed range
assigned to the controller.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
Each revmap type has different arguments for setting up the revmap.
This patch splits up the generator functions so that each revmap type
can do its own setup and the user doesn't need to keep track of how
each revmap type handles the arguments.
This patch also adds a host_data argument to the generators. There are
cases where the host_data pointer will be needed before the function returns.
ie. the legacy map calls the .map callback for each irq before returning.
v2: - Add void *host_data argument to irq_domain_add_*() functions
- fixed failure to compile
- Moved IRQ_DOMAIN_MAP_* defines into irqdomain.c
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
No functional changes. Replaces non-exported references to 'host' with domain.
Does not change any symbol names referenced by other .c files.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
zero always means no irq when using irq domains. Get rid of the NO_IRQ
references.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
This patch only moves the code. It doesn't make any changes, and the
code is still only compiled for powerpc. Follow-on patches will generalize
the code for other architectures.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
idle_cpu() is called on irq entry to guess if we need to call
tick_check_idle(). This way we can catch up with jiffies if the tick
was stopped, stop accounting idle time during the interrupt and
maintain the sched clock if it is unstable.
But if we are going to exit the idle loop to schedule a new task (ie:
if we have a task in the runqueue or a remotely enqueued ttwu to
perform), the idle_cpu() check will return 0 such that we miss the
call to tick_check_idle() for all interrupts happening before we
schedule the new task.
As a result these interrupts and the softirqs coming along may deal
with stale jiffies values, bad sched clock values, and won't substract
their time from the idle time accounting.
Fix this with using is_idle_task() instead that strictly checks that
we are running the idle task, without caring about the fact we are
going to schedule a task soon.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Link: http://lkml.kernel.org/r/1327427984-23282-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ts->inidle is set by tick_nohz_idle_enter() and unset by
tick_nohz_idle_exit(). However these two calls are assumed
to be always paired. This means that by the time we call
tick_nohz_idle_exit(), ts->inidle is supposed to be always
set to 1.
Remove the checks for ts->inidle in tick_nohz_idle_exit().
This simplifies a bit the code and improves its debuggability
(ie: ensure the call is paired with a tick_nohz_idle_enter()
call).
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Link: http://lkml.kernel.org/r/1327427984-23282-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is no reason to call update_ts_time_stat from tick_nohz_start_idle
anymore (after e0e37c20 sched: Eliminate the ts->idle_lastupdate field)
when we updated idle_lastupdate unconditionally.
We haven't set idle_active yet and do not provide last_update_time so
the whole call end up being just 2 wasted branches.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/1322755222-6951-1-git-send-email-mhocko@suse.cz
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Platforms with Always Running APIC Timer doesn't use the broadcast timer
but the kernel is leaving the broadcast timer (HPET in this case)
in oneshot mode.
On these platforms, before the switch to oneshot mode, broadcast device is
actually in shutdown mode. Code checks for empty tick_broadcast_mask and
avoids going into the periodic mode.
During switch to oneshot mode, add the same tick_broadcast_mask checks in the
tick_broadcast_switch_to_oneshot() and avoid the broadcast device going into
the oneshot mode.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: venki@google.com
Link: http://lkml.kernel.org/r/1320452301.15071.16.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No need to assign ret in each case and break. Simply return the result
of the handler function directly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Some of the newer futex PI opcodes do not check the cmpxchg enabled
variable and call unconditionally into the handling functions. Cover
all PI opcodes in a separate check.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@linux.intel.com>
An interrupt might be pending when irq_startup() is called, but the
startup code does not invoke the resend logic. In some cases this
prevents the device from issuing another interrupt which renders the
device non functional.
Call the resend function in irq_startup() to keep things going.
Reported-and-tested-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When the primary handler of an interrupt which is marked IRQ_ONESHOT
returns IRQ_HANDLED or IRQ_NONE, then the interrupt thread is not
woken and the unmask logic of the interrupt line is never
invoked. This keeps the interrupt masked forever.
This was not noticed as most IRQ_ONESHOT users wake the thread
unconditionally (usually because they cannot access the underlying
device from hard interrupt context). Though this behaviour was nowhere
documented and not necessarily intentional. Some drivers can avoid the
thread wakeup in certain cases and run into the situation where the
interrupt line s kept masked.
Handle it gracefully.
Reported-and-tested-by: Lothar Wassmann <lw@karo-electronics.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Part of the series to unify the irq remapping mechanisms in the
kernel. A follow up patch will copy the powerpc implementation into
kernel/irq/irqdomain.c, which will be a lot easier if the structures
are identical.
Where they differ, I've chose to use the powerpc names since there is
a lot more code using those names.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
irq_domain printk's too much. Drop some output.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Olof Johansson <olof@lixom.net>
When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allow bint param accept nul values, just do same as bool param.
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Collapse security_vm_enough_memory() variants into a single function.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
The advantage of kcalloc is, that will prevent integer overflows which could
result from the multiplication of number of elements and size and it is also
a bit nicer to read.
The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107
Link: http://lkml.kernel.org/r/1322600880.1534.347.camel@localhost.localdomain
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a printk.console trace point to record any printk
messages into the trace, regardless of the current
console loglevel. This can help correlate (existing)
printk debugging with other tracing.
Link: http://lkml.kernel.org/r/1322161388.5366.54.camel@jlt3.sipsolutions.net
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The Finish label in suspend_freeze_processes() is in fact unnecessary
and makes the function look more complicated than it really is, so
remove that label (along with a few empty lines).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Use the observation that it is more efficient to check the wakeup
variable once before the loop reporting tasks that were not
frozen in try_to_freeze_tasks() than to do that in every step of that
loop.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The PM QoS feature originally didn't depend on CONFIG_PM, which was
mistakenly changed by commit e8db0be124
PM QoS: Move and rename the implementation files
Later, commit d020283dc6
PM / QoS: CPU C-state breakage with PM Qos change
partially fixed that by introducing a static inline definition of
pm_qos_request(), but that still didn't allow user space to use
the PM QoS interface if CONFIG_PM was unset (which had been possible
before). For this reason, remove the dependency of PM QoS on
CONFIG_PM to make it work (as intended) with CONFIG_PM unset.
[rjw: Replaced the original changelog with a new one.]
Signed-off-by: Jean Pihet <j-pihet@ti.com>
Reported-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Actually, sched_switch function tracer is merged into wakeup/wakeup_rt
Update 'mini-HOWTO' for ftrace(Kernel function tracer).
If we want to trace "sched:sched_switch" to trace sched_switch func,
We may utilize event option.(e.g: trace-cmd list -e | grep sched)
This patch is based on Linux-3.3.rc2-SMP-PREEMPT
Link: http://lkml.kernel.org/r/1328695537-15081-1-git-send-email-geunsik.lim@gmail.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Says Jens:
"Time to push off some of the pending items. I really wanted to wait
until we had the regression nailed, but alas it's not quite there yet.
But I'm very confident that it's "just" a missing expire on exit, so
fix from Tejun should be fairly trivial. I'm headed out for a week on
the slopes.
- Killing the barrier part of mtip32xx. It doesn't really support
barriers, and it doesn't need them (writes are fully ordered).
- A few fixes from Dan Carpenter, preventing overflows of integer
multiplication.
- A fixup for loop, fixing a previous commit that didn't quite solve
the partial read problem from Dave Young.
- A bio integer overflow fix from Kent Overstreet.
- Improvement/fix of the door "keep locked" part of the cdrom shared
code from Paolo Benzini.
- A few cfq fixes from Shaohua Li.
- A fix for bsg sysfs warning when removing a file it did not create
from Stanislaw Gruszka.
- Two fixes for floppy from Vivek, preventing a crash.
- A few block core fixes from Tejun. One killing the over-optimized
ioc exit path, cleaning that up nicely. Two others fixing an oops
on elevator switch, due to calling into the scheduler merge check
code without holding the queue lock."
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix lockdep warning on io_context release put_io_context()
relay: prevent integer overflow in relay_open()
loop: zero fill bio instead of return -EIO for partial read
bio: don't overflow in bio_get_nr_vecs()
floppy: Fix a crash during rmmod
floppy: Cleanup disk->queue before caling put_disk() if add_disk() was never called
cdrom: move shared static to cdrom_device_info
bsg: fix sysfs link remove warning
block: don't call elevator callbacks for plug merges
block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions
mtip32xx: removed the irrelevant argument of mtip_hw_submit_io() and the unused member of struct driver_data
block: strip out locking optimization in put_io_context()
cdrom: use copy_to_user() without the underscores
block: fix ioc locking warning
block: fix NULL icq_cache reference
block,cfq: change code order
Reflect the change in the soft and hard lockup thresholds and
their relation to the frequency of the hrtimer and NMI events in
the code comments. While at it, remove references to files that
do not exist anymore.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1328827342-6253-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix double start/stop in x86_pmu_start()
perf evsel: Fix an issue where perf report fails to show the proper percentage
perf tools: Fix prefix matching for kernel maps
perf tools: Fix perf stack to non executable on x86_64
perf: Remove deprecated WARN_ON_ONCE()
"subbuf_size" and "n_subbufs" come from the user and they need to be
capped to prevent an integer overflow.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code related to 'freezer_test_done' is needlessly convoluted.
Refactor the code and simplify the implementation.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
In the hibernation call path, the kernel threads are frozen inside
hibernation_snapshot(). If we happen to encounter an error further down
the road or if we are exiting early due to a successful freezer test,
then thaw kernel threads before returning to the caller.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The code
if (error) {
suspend_stats.fail++;
dpm_save_failed_errno(error);
} else
suspend_stats.success++;
Appears in the kernel/power/main.c and kernel/power/suspend.c.
This patch just creates a new function to avoid duplicated code.
Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The current LSM interface to cred_free is not sufficient for allowing
an LSM to track the life and death of a task. This patch adds the
task_free hook so that an LSM can clean up resources on task death.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <jmorris@namei.org>
The following patch fixes a bug introduced by the following
commit:
e050e3f0a7 ("perf: Fix broken interrupt rate throttling")
The patch caused the following warning to pop up depending on
the sampling frequency adjustments:
------------[ cut here ]------------
WARNING: at arch/x86/kernel/cpu/perf_event.c:995 x86_pmu_start+0x79/0xd4()
It was caused by the following call sequence:
perf_adjust_freq_unthr_context.part() {
stop()
if (delta > 0) {
perf_adjust_period() {
if (period > 8*...) {
stop()
...
start()
}
}
}
start()
}
Which caused a double start and a double stop, thus triggering
the assert in x86_pmu_start().
The patch fixes the problem by avoiding the double calls. We
pass a new argument to perf_adjust_period() to indicate whether
or not the event is already stopped. We can't just remove the
start/stop from that function because it's called from
__perf_event_overflow where the event needs to be reloaded via a
stop/start back-toback call.
The patch reintroduces the assertion in x86_pmu_start() which
was removed by commit:
84f2b9b ("perf: Remove deprecated WARN_ON_ONCE()")
In this second version, we've added calls to disable/enable PMU
during unthrottling or frequency adjustment based on bug report
of spurious NMI interrupts from Eric Dumazet.
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: markus@trippelsdorf.de
Cc: paulus@samba.org
Link: http://lkml.kernel.org/r/20120207133956.GA4932@quad
[ Minor edits to the changelog and to the code ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue. It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.
While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization. Strip it out. If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
So that we can get the perf bench exec stack fixes and then apply the
remaining fix for the files added after what is in perf/urgent.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Three power management regression fixes, one for a recent regression introcuded
by the freezer changes during the 3.3 merge window and two for regressions
in cpuidle (resulting from PM QoS changes) and in the hibernate user space
interface, both introduced during the 3.2 development cycle.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJPLbiPAAoJEKhOf7ml8uNscvwQAJYYhSBL+ouK8ERS0OLkeEoB
k4O1Ap0hb5Kv54Sr85WKEm5zGRDJXUxlWeMklo9K/fvs04CU1gsBb8jhDdbZ2ovE
rnyybPjfieExQbLxX6nYIP4qKMLtnZvHhHpafuDSUz0RWq/7sCTiFI2htNj97gGu
DzXYpeePFgvzG6AaznywWkvNdXoQfmsTC0adDrXWcuKXnNrH6h8o/OIB+pO70Szw
gmU8SjVGGQjrlnuQ+Ku4WqbSyXs1bXlUkyTHJilg6CNJySrA/LUHhKPrRnP1i3Hu
LxX/rsrTqohhD1tz1qQOpnMiu86FSez+UVA65b2cF3EqZbNROY2+O1/V+OlczKYy
V9Q3rk+J4uRJtnL8DEgcniMGrRsjyle5USN5KDX50BkrC56h3mZirnEu1yaiMIJn
K8NWI/4JdK7JbA6f2hXuPuesmudSP4uo8vuUzKthEUi88QReYXYSMcz/Fy/G9z8n
JW7PimC5OmeTwYIqBcjZf+8j/1u6cHaEkvjPAJhIUgCR/ZVi6VFySnUByDD6JKTJ
bQcUSqZZ8TvEc4A6JjG18/QfmWIZMErfuG0WAKb8sqtXoPkHKR/XXjbaXof9Oppn
nRS5iJUaZGY4YivSHZZOFAk24ThqKx5ZK3qXq/dBbj9JwtJdc+++b9f0RwXUHjd9
ECoM3bFtO8ewINmZ7wRQ
=EKGs
-----END PGP SIGNATURE-----
Merge tag 'pm-fixes-for-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Power management fixes for 3.3-rc3
Three power management regression fixes, one for a recent regression introcuded
by the freezer changes during the 3.3 merge window and two for regressions
in cpuidle (resulting from PM QoS changes) and in the hibernate user space
interface, both introduced during the 3.2 development cycle.
They include:
* Two hibernate (s2disk) regression fixes from Srivatsa S. Bhat (for
regressions introduced during the 3.3 merge window and during the 3.2
development cycle).
* A cpuidle fix from Venki Pallipadi for a regression resulting from PM QoS
changes during the 3.2 development cycle causing cpuidle to work incorrectly
for CONFIG_PM unset.
* tag 'pm-fixes-for-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / QoS: CPU C-state breakage with PM Qos change
PM / Freezer: Thaw only kernel threads if freezing of kernel threads fails
PM / Hibernate: Thaw kernel threads in SNAPSHOT_CREATE_IMAGE ioctl path
If freezing of kernel threads fails, we are expected to automatically
thaw tasks in the error recovery path. However, at times, we encounter
situations in which we would like the automatic error recovery path
to thaw only the kernel threads, because we want to be able to do
some more cleanup before we thaw userspace. Something like:
error = freeze_kernel_threads();
if (error) {
/* Do some cleanup */
/* Only then thaw userspace tasks*/
thaw_processes();
}
An example of such a situation is where we freeze/thaw filesystems
during suspend/hibernation. There, if freezing of kernel threads
fails, we would like to thaw the frozen filesystems before thawing
the userspace tasks.
So, modify freeze_kernel_threads() to thaw only kernel threads in
case of freezing failure. And change suspend_freeze_processes()
accordingly. (At the same time, let us also get rid of the rather
cryptic usage of the conditional operator (:?) in that function.)
[rjw: In fact, this patch fixes a regression introduced during the
3.3 merge window, because without it thaw_processes() may be called
before swsusp_free() in some situations and that may lead to massive
memory allocation failures.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
In function pre_handler_kretprobe(), the allocated kretprobe_instance
object will get leaked if the entry_handler callback returns non-zero.
This may cause all the preallocated kretprobe_instance objects exhausted.
This issue can be reproduced by changing
samples/kprobes/kretprobe_example.c to probe "mutex_unlock". And the fix
is straightforward: just put the allocated kretprobe_instance object back
onto the free_instances list.
[akpm@linux-foundation.org: use raw_spin_lock/unlock]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
EXPORT_SYMBOL(adjust_resource) should be right after adjust_resource().
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The __raise_softirq_irqoff() contains a tracepoint. As tracepoints in headers
can cause issues, and not to mention, bloats the kernel when they are
in a static inline, it is best to move the function that contains the
tracepoint out of the header and into softirq.c.
Link: http://lkml.kernel.org/r/20120118120711.GB14863@elte.hu
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the ftrace_set_filter and ftrace_set_notrace functions
do not return any return code. So there's no way for ftrace_ops
user to tell wether the filter was correctly applied.
The set_ftrace_filter interface returns error in case the filter
did not match:
# echo krava > set_ftrace_filter
bash: echo: write error: Invalid argument
Changing both ftrace_set_filter and ftrace_set_notrace functions
to return zero if the filter was applied correctly or -E* values
in case of error.
Link: http://lkml.kernel.org/r/1325495060-6402-2-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As noted by Arve and others, since wall time can jump backwards, it is
difficult to use for input because one cannot determine if one event
occurred before another or for how long a key was pressed.
However, the timestamp field is part of the kernel ABI, and cannot be
changed without possibly breaking existing users.
This patch adds a new IOCTL that allows a clockid to be set in the
evdev_client struct that will specify which time base to use for event
timestamps (ie: CLOCK_MONOTONIC instead of CLOCK_REALTIME).
For now we only support CLOCK_MONOTONIC and CLOCK_REALTIME, but
in the future we could support other clockids if appropriate.
The default remains CLOCK_REALTIME, so we don't change the ABI.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Daniel Kurtz <djkurtz@google.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
This fixes the race in process_vm_core found by Oleg (see
http://article.gmane.org/gmane.linux.kernel/1235667/
for details).
This has been updated since I last sent it as the creation of the new
mm_access() function did almost exactly the same thing as parts of the
previous version of this patch did.
In order to use mm_access() even when /proc isn't enabled, we move it to
kernel/fork.c where other related process mm access functions already
are.
Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bugs, x86: Fix printk levels for panic, softlockups and stack dumps
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf top: Fix number of samples displayed
perf tools: Fix strlen() bug in perf_event__synthesize_event_type()
perf tools: Fix broken build by defining _GNU_SOURCE in Makefile
x86/dumpstack: Remove unneeded check in dump_trace()
perf: Fix broken interrupt rate throttling
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW
sched: Fix ancient race in do_exit()
sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug
sched/s390: Fix compile error in sched/core.c
sched: Fix rq->nr_uninterruptible update race
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/reboot: Remove VersaLogic Menlow reboot quirk
x86/reboot: Skip DMI checks if reboot set by user
x86: Properly parenthesize cmpxchg() macro arguments
The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.
Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().
So we reduce a few lines of code, though the shrinking of object size
is minimal.
16 files changed, 113 insertions(+), 162 deletions(-)
text data bss dec hex filename
5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
5486170 656987 7039960 13183117 c9288d vmlinux.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In the SNAPSHOT_CREATE_IMAGE ioctl, if the call to hibernation_snapshot()
fails, the frozen tasks are not thawed.
And in the case of success, if we happen to exit due to a successful freezer
test, all tasks (including those of userspace) are thawed, whereas actually
we should have thawed only the kernel threads at that point. Fix both these
issues.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
This also fixes the following compilation warning on !SMP:
CC kernel/sched/fair.o
kernel/sched/fair.c:218:36: warning: 'max_load_balance_interval' defined but not used [-Wunused-variable]
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F2754A0.9090306@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>