Commit Graph

15574 Commits

Author SHA1 Message Date
Joonsoo Kim
f1c4069e1d mm, vmalloc: export vmap_area_list, instead of vmlist
Although our intention is to unexport internal structure entirely, but
there is one exception for kexec.  kexec dumps address of vmlist and
makedumpfile uses this information.

We are about to remove vmlist, then another way to retrieve information
of vmalloc layer is needed for makedumpfile.  For this purpose, we
export vmap_area_list, instead of vmlist.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:34 -07:00
Josh Triplett
146732ce10 fs: don't compile in drop_caches.c when CONFIG_SYSCTL=n
drop_caches.c provides code only invokable via sysctl, so don't compile it
in when CONFIG_SYSCTL=n.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Michal Hocko
6d2488f64a cgroup: remove css_get_next
Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Jiang Liu
e07cee23e6 mm,kexec: use common help functions to free reserved pages
Use common help functions to free reserved pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:31 -07:00
Chen Gang
12b2f117f3 kernel/audit_tree.c: tree will leak memory when failure occurs in audit_trim_trees()
audit_trim_trees() calls get_tree().  If a failure occurs we must call
put_tree().

[akpm@linux-foundation.org: run put_tree() before mutex_lock() for small scalability improvement]
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Chen Gang
373e0f3408 kernel/auditfilter.c: tree and watch will memory leak when failure occurs
In audit_data_to_entry() when a failure occurs we must check and free
the tree and watch to avoid a memory leak.

  test:
    plan:
      test command:
        "auditctl -a exit,always -w /etc -F auid=-1"
        (on fedora17, need modify auditctl to let "-w /etc" has effect)
      running:
        under fedora17 x86_64, 2 CPUs 3.20GHz, 2.5GB RAM.
        let 15 auditctl processes continue running at the same time.
      monitor command:
        watch -d -n 1 "cat /proc/meminfo | awk '{print \$2}' \
          | head -n 4 | xargs \
          | awk '{print \"used \",\$1 - \$2 - \$3 - \$4}'"

    result:
      for original version:
        will use up all memory, within 3 hours.
        kill all auditctl, the memory still does not free.
      for new version (apply this patch):
        after 14 hours later, not find issues.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Gao feng
dde5b7d6e7 audit: remove unnecessary #if CONFIG_AUDIT
The files which include kernel/audit.h are complied only when
CONFIG_AUDIT is set.

Just like audit_pid, there is no need to surround audit_ever_enabled
with CONFIG_AUDIT.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Gao feng
374c586d95 audit: remove duplicate export of audit_enabled
audit_enabled has already been exported in include/linux/audit.h.  and
kernel/audit.h includes include/linux/audit.h, no need to export
aduit_enabled again in kernel/audit.h

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Gao feng
13f51e1c3f audit: don't check if kauditd is valid every time
We only need to check if kauditd is valid after we start it, if kauditd
is invalid, we will set kauditd_task to NULL.  So next time, we will
start kauditd again.

It means if kauditd_task is not NULL,it must be valid.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Rakib Mullick
3f68613f39 kernel/auditsc.c: use kzalloc instead of kmalloc+memset
In audit_alloc_context() use kzalloc instead of kmalloc+memset.  Also
rename audit_zero_context() to audit_set_context(), to represent it's
inner workings properly.

[akpm@linux-foundation.org: remove audit_set_context() altogether - fold it into its caller]
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Oleg Nesterov
b5c5442bb6 kthread: kill task_get_live_kthread()
task_get_live_kthread() looks confusing and unneeded.  It does
get_task_struct() but only kthread_stop() needs this, it can be called
even if the calller doesn't have a reference when we know that this
kthread can't exit until we do kthread_stop().

kthread_park() and kthread_unpark() do not need get_task_struct(), the
callers already have the reference.  And it can not help if we can race
with the exiting kthread anyway, kthread_park() can hang forever in this
case.

Change kthread_park() and kthread_unpark() to use to_live_kthread(),
change kthread_stop() to do get_task_struct() by hand and remove
task_get_live_kthread().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:25 -07:00
Oleg Nesterov
4ecdafc808 kthread: introduce to_live_kthread()
"k->vfork_done != NULL" with a barrier() after to_kthread(k) in
task_get_live_kthread(k) looks unclear, and sub-optimal because we load
->vfork_done twice.

All we need is to ensure that we do not return to_kthread(NULL).  Add a
new trivial helper which loads/checks ->vfork_done once, this also looks
more understandable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:25 -07:00
Linus Torvalds
9e8529afc4 Merge tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
 "Along with the usual minor fixes and clean ups there are a few major
  changes with this pull request.

   1) Multiple buffers for the ftrace facility

  This feature has been requested by many people over the last few
  years.  I even heard that Google was about to implement it themselves.
  I finally had time and cleaned up the code such that you can now
  create multiple instances of the ftrace buffer and have different
  events go to different buffers.  This way, a low frequency event will
  not be lost in the noise of a high frequency event.

  Note, currently only events can go to different buffers, the tracers
  (ie function, function_graph and the latency tracers) still can only
  be written to the main buffer.

   2) The function tracer triggers have now been extended.

  The function tracer had two triggers.  One to enable tracing when a
  function is hit, and one to disable tracing.  Now you can record a
  stack trace on a single (or many) function(s), take a snapshot of the
  buffer (copy it to the snapshot buffer), and you can enable or disable
  an event to be traced when a function is hit.

   3) A perf clock has been added.

  A "perf" clock can be chosen to be used when tracing.  This will cause
  ftrace to use the same clock as perf uses, and hopefully this will
  make it easier to interleave the perf and ftrace data for analysis."

* tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (82 commits)
  tracepoints: Prevent null probe from being added
  tracing: Compare to 1 instead of zero for is_signed_type()
  tracing: Remove obsolete macro guard _TRACE_PROFILE_INIT
  ftrace: Get rid of ftrace_profile_bits
  tracing: Check return value of tracing_init_dentry()
  tracing: Get rid of unneeded key calculation in ftrace_hash_move()
  tracing: Reset ftrace_graph_filter_enabled if count is zero
  tracing: Fix off-by-one on allocating stat->pages
  kernel: tracing: Use strlcpy instead of strncpy
  tracing: Update debugfs README file
  tracing: Fix ftrace_dump()
  tracing: Rename trace_event_mutex to trace_event_sem
  tracing: Fix comment about prefix in arch_syscall_match_sym_name()
  tracing: Convert trace_destroy_fields() to static
  tracing: Move find_event_field() into trace_events.c
  tracing: Use TRACE_MAX_PRINT instead of constant
  tracing: Use pr_warn_once instead of open coded implementation
  ring-buffer: Add ring buffer startup selftest
  tracing: Bring Documentation/trace/ftrace.txt up to date
  tracing: Add "perf" trace_clock
  ...

Conflicts:
	kernel/trace/ftrace.c
	kernel/trace/trace.c
2013-04-29 13:55:38 -07:00
Linus Torvalds
916bb6d76d Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking changes from Ingo Molnar:
 "The most noticeable change are mutex speedups from Waiman Long, for
  higher loads.  These scalability changes should be most noticeable on
  larger server systems.

  There are also cleanups, fixes and debuggability improvements."

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Consolidate bug messages into a single print_lockdep_off() function
  lockdep: Print out additional debugging advice when we hit lockdep BUGs
  mutex: Back out architecture specific check for negative mutex count
  mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
  mutex: Make more scalable by doing less atomic operations
  mutex: Move mutex spinning code from sched/core.c back to mutex.c
  locking/rtmutex/tester: Set correct permissions on sysfs files
  lockdep: Remove unnecessary 'hlock_next' variable
2013-04-29 08:21:37 -07:00
Li Zefan
2a0010af17 cpuset: fix compile warning when CONFIG_SMP=n
Reported by Fengguang's kbuild test robot:

kernel/cpuset.c:787: warning: 'generate_sched_domains' defined but not used

Introduced by commit e0e80a02e5
("cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()),
which removed generate_sched_domains() from cpuset_hotplug_workfn().

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-27 19:55:04 -07:00
Linus Torvalds
e09d13c4c8 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
 "This fix adds missing RCU read protection"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  events: Protect access via task_subsys_state_check()
2013-04-27 10:08:09 -07:00
Li Zefan
5b16c2a493 cpuset: fix cpu hotplug vs rebuild_sched_domains() race
rebuild_sched_domains() might pass doms with offlined cpu to
partition_sched_domains(), which results in an oops:

general protection fault: 0000 [#1] SMP
...
RIP: 0010:[<ffffffff81077a1e>]  [<ffffffff81077a1e>] get_group+0x6e/0x90
...
Call Trace:
 [<ffffffff8107f07c>] build_sched_domains+0x70c/0xcb0
 [<ffffffff8107f2a7>] ? build_sched_domains+0x937/0xcb0
 [<ffffffff81173f64>] ? kfree+0xe4/0x1b0
 [<ffffffff8107f6e0>] ? partition_sched_domains+0xc0/0x470
 [<ffffffff8107f905>] partition_sched_domains+0x2e5/0x470
 [<ffffffff8107f6e0>] ? partition_sched_domains+0xc0/0x470
 [<ffffffff810c9007>] ? generate_sched_domains+0xc7/0x530
 [<ffffffff810c94a8>] rebuild_sched_domains_locked+0x38/0x70
 [<ffffffff810cb4a4>] cpuset_write_resmask+0x1a4/0x500
 [<ffffffff810c8700>] ? cpuset_mount+0xe0/0xe0
 [<ffffffff810c7f50>] ? cpuset_read_u64+0x100/0x100
 [<ffffffff810be890>] ? cgroup_iter_next+0x90/0x90
 [<ffffffff810cb300>] ? cpuset_css_offline+0x70/0x70
 [<ffffffff810c1a73>] cgroup_file_write+0x133/0x2e0
 [<ffffffff8118995b>] vfs_write+0xcb/0x130
 [<ffffffff8118a174>] sys_write+0x64/0xa0

Reported-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-27 06:52:43 -07:00
Li Zhong
e0e80a02e5 cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
In cpuset_hotplug_workfn(), partition_sched_domains() is called without
hotplug lock held, which is actually needed (stated in the function
header of partition_sched_domains()).

This patch tries to use rebuild_sched_domains() to solve the above
issue, and makes the code looks a little simpler.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-27 06:52:43 -07:00
Li Zefan
7ef70e4873 cgroup: restore the call to eventfd->poll()
I mistakenly removed the call to eventfd->poll() while I was actually
intending to remove the return value...

Calling evenfd->poll() will hook cgroup_event_wake() to the poll
waitqueue, which will be called to unregister eventfd when rmdir a
cgroup or close eventfd.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-26 11:58:03 -07:00
Li Zefan
cc20e01cd6 cgroup: fix use-after-free when umounting cgroupfs
Try:
  # mount -t cgroup xxx /cgroup
  # mkdir /cgroup/sub && rmdir /cgroup/sub && umount /cgroup

And you might see this:

ida_remove called for id=1 which is not allocated.

It's because cgroup_kill_sb() is called to destroy root->cgroup_ida
and free cgrp->root before ida_simple_removed() is called. What's
worse is we're accessing cgrp->root while it has been freed.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-26 11:58:02 -07:00
Vincent Guittot
25f55d9d01 sched: Fix init NOHZ_IDLE flag
On my SMP platform which is made of 5 cores in 2 clusters, I
have the nr_busy_cpu field of sched_group_power struct that is
not null when the platform is fully idle - which makes the
scheduler unhappy.

The root cause is:

During the boot sequence, some CPUs reach the idle loop and set
their NOHZ_IDLE flag while waiting for others CPUs to boot. But
the nr_busy_cpus field is initialized later with the assumption
that all CPUs are in the busy state whereas some CPUs have
already set their NOHZ_IDLE flag.

More generally, the NOHZ_IDLE flag must be initialized when new
sched_domains are created in order to ensure that NOHZ_IDLE and
nr_busy_cpus are aligned.

This condition can be ensured by adding a synchronize_rcu()
between the destruction of old sched_domains and the creation of
new ones so the NOHZ_IDLE flag will not be updated with old
sched_domain once it has been initialized. But this solution
introduces a additionnal latency in the rebuild sequence that is
called during cpu hotplug.

As suggested by Frederic Weisbecker, another solution is to have
the same rcu lifecycle for both NOHZ_IDLE and sched_domain
struct. A new nohz_idle field is added to sched_domain so both
status and sched_domain will share the same RCU lifecycle and
will be always synchronized. In addition, there is no more need
to protect nohz_idle against concurrent access as it is only
modified by 2 exclusive functions called by local cpu.

This solution has been prefered to the creation of a new struct
with an extra pointer indirection for sched_domain.

The synchronization is done at the cost of :

 - An additional indirection and a rcu_dereference for accessing nohz_idle.
 - We use only the nohz_idle field of the top sched_domain.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: fweisbec@gmail.com
Cc: pjt@google.com
Cc: rostedt@goodmis.org
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366729142-14662-1-git-send-email-vincent.guittot@linaro.org
[ Fixed !NO_HZ build bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 12:13:44 +02:00
Dave Jones
2c52283662 lockdep: Consolidate bug messages into a single print_lockdep_off() function
Also add some missing printk levels.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20130425174002.GA26769@redhat.com
[ Tweaked the messages a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 08:37:22 +02:00
Dave Jones
199e371f59 lockdep: Print out additional debugging advice when we hit lockdep BUGs
We occasionally get reports of these BUGs being hit, and the
stack trace doesn't necessarily always tell us what we need to
know about why we are hitting those limits.

If users start attaching /proc/lock_stats to reports we may have
more of a clue what's going on.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20130423163403.GA12839@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 08:36:33 +02:00
Thomas Gleixner
6f7a05d701 clockevents: Set dummy handler on CPU_DEAD shutdown
Vitaliy reported that a per cpu HPET timer interrupt crashes the
system during hibernation. What happens is that the per cpu HPET timer
gets shut down when the nonboot cpus are stopped. When the nonboot
cpus are onlined again the HPET code sets up the MSI interrupt which
fires before the clock event device is registered. The event handler
is still set to hrtimer_interrupt, which then crashes the machine due
to highres mode not being active.

See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700333

There is no real good way to avoid that in the HPET code. The HPET
code alrady has a mechanism to detect spurious interrupts when event
handler == NULL for a similar reason.

We can handle that in the clockevent/tick layer and replace the
previous functional handler with a dummy handler like we do in
tick_setup_new_device().

The original clockevents code did this in clockevents_exchange_device(),
but that got removed by commit 7c1e76897 (clockevents: prevent
clockevent event_handler ending up handler_noop) which forgot to fix
it up in tick_shutdown(). Same issue with the broadcast device.

Reported-by: Vitaliy Fillipov <vitalif@yourcmc.ru>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org
Cc: 700333@bugs.debian.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-25 13:57:04 +02:00
Thomas Gleixner
6402c7dc2a Merge branch 'linus' into timers/core
Reason: Get upstream fixes before adding conflicting code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-24 20:33:54 +02:00
Joonsoo Kim
e02e60c109 sched: Prevent to re-select dst-cpu in load_balance()
Commit 88b8dac0 makes load_balance() consider other cpus in its
group. But, in that, there is no code for preventing to
re-select dst-cpu. So, same dst-cpu can be selected over and
over.

This patch add functionality to load_balance() in order to
exclude cpu which is selected once. We prevent to re-select
dst_cpu via env's cpus, so now, env's cpus is a candidate not
only for src_cpus, but also dst_cpus.

With this patch, we can remove lb_iterations and
max_lb_iterations, because we decide whether we can go ahead or
not via env's cpus.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:46 +02:00
Joonsoo Kim
e6252c3ef4 sched: Rename load_balance_tmpmask to load_balance_mask
This name doesn't represent specific meaning.
So rename it to imply it's purpose.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:45 +02:00
Joonsoo Kim
d31980846f sched: Move up affinity check to mitigate useless redoing overhead
Currently, LBF_ALL_PINNED is cleared after affinity check is
passed. So, if task migration is skipped by small load value or
small imbalance value in move_tasks(), we don't clear
LBF_ALL_PINNED. At last, we trigger 'redo' in load_balance().

Imbalance value is often so small that any tasks cannot be moved
to other cpus and, of course, this situation may be continued
after we change the target cpu. So this patch move up affinity
check code and clear LBF_ALL_PINNED before evaluating load value
in order to mitigate useless redoing overhead.

In addition, re-order some comments correctly.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:44 +02:00
Joonsoo Kim
cfc0311804 sched: Don't consider other cpus in our group in case of NEWLY_IDLE
Commit 88b8dac0 makes load_balance() consider other cpus in its
group, regardless of idle type. When we do NEWLY_IDLE balancing,
we should not consider it, because a motivation of NEWLY_IDLE
balancing is to turn this cpu to non idle state if needed. This
is not the case of other cpus. So, change code not to consider
other cpus for NEWLY_IDLE balancing.

With this patch, assign 'if (pulled_task) this_rq->idle_stamp =
0' in idle_balance() is corrected, because NEWLY_IDLE balancing
doesn't consider other cpus. Assigning to 'this_rq->idle_stamp'
is now valid.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:44 +02:00
Joonsoo Kim
de5eb2dd7f sched: Explicitly cpu_idle_type checking in rebalance_domains()
After commit 88b8dac0, dst-cpu can be changed in load_balance(),
then we can't know cpu_idle_type of dst-cpu when load_balance()
return positive. So, add explicit cpu_idle_type checking.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:43 +02:00
Joonsoo Kim
f1cd085810 sched: Change position of resched_cpu() in load_balance()
cur_ld_moved is reset if env.flags hit LBF_NEED_BREAK.
So, there is possibility that we miss doing resched_cpu().
Correct it as changing position of resched_cpu()
before checking LBF_NEED_BREAK.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:43 +02:00
Thomas Gleixner
77c675ba18 timekeeping: Update tk->cycle_last in resume
commit 7ec98e15aa (timekeeping: Delay update of clock->cycle_last)
forgot to update tk->cycle_last in the resume path. This results in a
stale value versus clock->cycle_last and prevents resume in the worst
case.

Reported-by: Jiri Slaby <jslaby@suse.cz>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Linux-pm mailing list <linux-pm@lists.linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304211648150.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:17:51 +02:00
Rusty Russell
f83b293366 kernel/hz.bc: ignore.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-22 07:09:06 -07:00
Linus Torvalds
3125929454 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix offcore_rsp valid mask for SNB/IVB
  perf: Treat attr.config as u64 in perf_swevent_init()
2013-04-21 10:25:42 -07:00
Vincent Guittot
642dbc39ab sched: Fix wrong rq's runnable_avg update with rt tasks
The current update of the rq's load can be erroneous when RT
tasks are involved.

The update of the load of a rq that becomes idle, is done only
if the avg_idle is less than sysctl_sched_migration_cost. If RT
tasks and short idle duration alternate, the runnable_avg will
not be updated correctly and the time will be accounted as idle
time when a CFS task wakes up.

A new idle_enter function is called when the next task is the
idle function so the elapsed time will be accounted as run time
in the load of the rq, whatever the average idle time is. The
function update_rq_runnable_avg is removed from idle_balance.

When a RT task is scheduled on an idle CPU, the update of the
rq's load is not done when the rq exit idle state because CFS's
functions are not called. Then, the idle_balance, which is
called just before entering the idle function, updates the rq's
load and makes the assumption that the elapsed time since the
last update, was only running time.

As a consequence, the rq's load of a CPU that only runs a
periodic RT task, is close to LOAD_AVG_MAX whatever the running
duration of the RT task is.

A new idle_exit function is called when the prev task is the
idle function so the elapsed time will be accounted as idle time
in the rq's load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: pjt@google.com
Cc: fweisbec@gmail.com
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21 11:22:52 +02:00
Paul E. McKenney
c79aa0d965 events: Protect access via task_subsys_state_check()
The following RCU splat indicates lack of RCU protection:

[  953.267649] ===============================
[  953.267652] [ INFO: suspicious RCU usage. ]
[  953.267657] 3.9.0-0.rc6.git2.4.fc19.ppc64p7 #1 Not tainted
[  953.267661] -------------------------------
[  953.267664] include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage!
[  953.267669]
[  953.267669] other info that might help us debug this:
[  953.267669]
[  953.267675]
[  953.267675] rcu_scheduler_active = 1, debug_locks = 0
[  953.267680] 1 lock held by glxgears/1289:
[  953.267683]  #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [<c00000000027f884>] .prepare_bprm_creds+0x34/0xa0
[  953.267700]
[  953.267700] stack backtrace:
[  953.267704] Call Trace:
[  953.267709] [c0000001f0d1b6e0] [c000000000016e30] .show_stack+0x130/0x200 (unreliable)
[  953.267717] [c0000001f0d1b7b0] [c0000000001267f8] .lockdep_rcu_suspicious+0x138/0x180
[  953.267724] [c0000001f0d1b840] [c0000000001d43a4] .perf_event_comm+0x4c4/0x690
[  953.267731] [c0000001f0d1b950] [c00000000027f6e4] .set_task_comm+0x84/0x1f0
[  953.267737] [c0000001f0d1b9f0] [c000000000280414] .setup_new_exec+0x94/0x220
[  953.267744] [c0000001f0d1ba70] [c0000000002f665c] .load_elf_binary+0x58c/0x19b0
...

This commit therefore adds the required RCU read-side critical
section to perf_event_comm().

Reported-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: a.p.zijlstra@chello.nl
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Link: http://lkml.kernel.org/r/20130419190124.GA8638@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Gustavo Luiz Duarte <gusld@br.ibm.com>
2013-04-21 11:21:39 +02:00
Ingo Molnar
73e21ce28d Merge branch 'perf/urgent' into perf/core
Conflicts:
	arch/x86/kernel/cpu/perf_event_intel.c

Merge in the latest fixes before applying new patches, resolve the conflict.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21 10:57:33 +02:00
Linus Torvalds
830ac8524f Merge branch 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kdump fixes from Peter Anvin:
 "The kexec/kdump people have found several problems with the support
  for loading over 4 GiB that was introduced in this merge cycle.  This
  is partly due to a number of design problems inherent in the way the
  various pieces of kdump fit together (it is pretty horrifically manual
  in many places.)

  After a *lot* of iterations this is the patchset that was agreed upon,
  but of course it is now very late in the cycle.  However, because it
  changes both the syntax and semantics of the crashkernel option, it
  would be desirable to avoid a stable release with the broken
  interfaces."

I'm not happy with the timing, since originally the plan was to release
the final 3.9 tomorrow.  But apparently I'm doing an -rc8 instead...

* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kexec: use Crash kernel for Crash kernel low
  x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
  x86, kdump: Retore crashkernel= to allocate under 896M
  x86, kdump: Set crashkernel_low automatically
2013-04-20 18:40:36 -07:00
Sahara
4c69e6ea41 tracepoints: Prevent null probe from being added
Somehow tracepoint_entry_add_probe() function allows a null probe function.
And, this may lead to unexpected results since the number of probe
functions in an entry can be counted by checking whether a probe is null
or not in the for-loop.
This patch prevents a null probe from being added.
In tracepoint_entry_remove_probe() function, checking probe parameter
within the for-loop is moved out for code efficiency, leaving the null probe
feature which removes all probe functions in the entry.

Link: http://lkml.kernel.org/r/1365991995-19445-1-git-send-email-kpark3469@gmail.com

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-19 19:59:49 -04:00
Waiman Long
cc189d2513 mutex: Back out architecture specific check for negative mutex count
Linus suggested that probably all the supported architectures can
allow a negative mutex count without incorrect behavior, so we can
then back out the architecture specific change and allow the
mutex count to go to any negative number. That should further
reduce contention for non-x86 architecture.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-5-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:36 +02:00
Waiman Long
2bd2c92cf0 mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option
turned on) allow multiple tasks to spin on a single mutex
concurrently. A potential problem with the current approach is
that when the mutex becomes available, all the spinning tasks
will try to acquire the mutex more or less simultaneously. As a
result, there will be a lot of cacheline bouncing especially on
systems with a large number of CPUs.

This patch tries to reduce this kind of contention by putting
the mutex spinners into a queue so that only the first one in
the queue will try to acquire the mutex. This will reduce
contention and allow all the tasks to move forward faster.

The queuing of mutex spinners is done using an MCS lock based
implementation which will further reduce contention on the mutex
cacheline than a similar ticket spinlock based implementation.
This patch will add a new field into the mutex data structure
for holding the MCS lock. This expands the mutex size by 8 bytes
for 64-bit system and 4 bytes for 32-bit system. This overhead
will be avoid if the MUTEX_SPIN_ON_OWNER option is turned off.

The following table shows the jobs per minute (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the fserver
workloads to 1/2/4/8 nodes with hyperthreading off.

+-----------------+-----------+-----------+-------------+----------+
|  Configuration  | Mean JPM  | Mean JPM  |  Mean JPM   | % Change |
|                 | w/o patch | patch 1   | patches 1&2 |  1->1&2  |
+-----------------+------------------------------------------------+
|                 |              User Range 1100 - 2000            |
+-----------------+------------------------------------------------+
| 8 nodes, HT off |  227972   |  227237   |   305043    |  +34.2%  |
| 4 nodes, HT off |  393503   |  381558   |   394650    |   +3.4%  |
| 2 nodes, HT off |  334957   |  325240   |   338853    |   +4.2%  |
| 1 node , HT off |  198141   |  197972   |   198075    |   +0.1%  |
+-----------------+------------------------------------------------+
|                 |              User Range 200 - 1000             |
+-----------------+------------------------------------------------+
| 8 nodes, HT off |  282325   |  312870   |   332185    |   +6.2%  |
| 4 nodes, HT off |  390698   |  378279   |   393419    |   +4.0%  |
| 2 nodes, HT off |  336986   |  326543   |   340260    |   +4.2%  |
| 1 node , HT off |  197588   |  197622   |   197582    |    0.0%  |
+-----------------+-----------+-----------+-------------+----------+

At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.

The fserver workload uses mutex spinning extensively. With just
the mutex change in the first patch, there is no noticeable
change in performance.  Rather, there is a slight drop in
performance. This mutex spinning patch more than recovers the
lost performance and show a significant increase of +30% at high
user load with the full 8 nodes. Similar improvements were also
seen in a 3.8 kernel.

The table below shows the %time spent by different kernel
functions as reported by perf when running the fserver workload
at 1500 users with all 8 nodes.

+-----------------------+-----------+---------+-------------+
|        Function       |  % time   | % time  |   % time    |
|                       | w/o patch | patch 1 | patches 1&2 |
+-----------------------+-----------+---------+-------------+
| __read_lock_failed    |  34.96%   | 34.91%  |   29.14%    |
| __write_lock_failed   |  10.14%   | 10.68%  |    7.51%    |
| mutex_spin_on_owner   |   3.62%   |  3.42%  |    2.33%    |
| mspin_lock            |    N/A    |   N/A   |    9.90%    |
| __mutex_lock_slowpath |   1.46%   |  0.81%  |    0.14%    |
| _raw_spin_lock        |   2.25%   |  2.50%  |    1.10%    |
+-----------------------+-----------+---------+-------------+

The fserver workload for an 8-node system is dominated by the
contention in the read/write lock. Mutex contention also plays a
role. With the first patch only, mutex contention is down (as
shown by the __mutex_lock_slowpath figure) which help a little
bit. We saw only a few percents improvement with that.

By applying patch 2 as well, the single mutex_spin_on_owner
figure is now split out into an additional mspin_lock figure.
The time increases from 3.42% to 11.23%. It shows a great
reduction in contention among the spinners leading to a 30%
improvement. The time ratio 9.9/2.33=4.3 indicates that there
are on average 4+ spinners waiting in the spin_lock loop for
each spinner in the mutex_spin_on_owner loop. Contention in
other locking functions also go down by quite a lot.

The table below shows the performance change of both patches 1 &
2 over patch 1 alone in other AIM7 workloads (at 8 nodes,
hyperthreading off).

+--------------+---------------+----------------+-----------------+
|   Workload   | mean % change | mean % change  | mean % change   |
|              | 10-100 users  | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests     |      0.0%     |     -0.8%      |     +0.6%       |
| five_sec     |     -0.3%     |     +0.8%      |     +0.8%       |
| high_systime |     +0.4%     |     +2.4%      |     +2.1%       |
| new_fserver  |     +0.1%     |    +14.1%      |    +34.2%       |
| shared       |     -0.5%     |     -0.3%      |     -0.4%       |
| short        |     -1.7%     |     -9.8%      |     -8.3%       |
+--------------+---------------+----------------+-----------------+

The short workload is the only one that shows a decline in
performance probably due to the spinner locking and queuing
overhead.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-4-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:36 +02:00
Waiman Long
0dc8c730c9 mutex: Make more scalable by doing less atomic operations
In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a
total of three atomic read-modify-write instructions will be
issued in rapid succession. This can cause a lot of cache
bouncing when many tasks are trying to acquire the mutex at the
same time.

This patch will reduce the number of atomic_xchg instructions
used by checking the counter value first before issuing the
instruction. The atomic_read() function is just a simple memory
read. The atomic_xchg() function, on the other hand, can be up
to 2 order of magnitude or even more in cost when compared with
atomic_read(). By using atomic_read() to check the value first
before calling atomic_xchg(), we can avoid a lot of unnecessary
cache coherency traffic. The only downside with this change is
that a task on the slow path will have a tiny bit less chance of
getting the mutex when competing with another task in the fast
path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed
before calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture
can allow any negative number to be used in the mutex count to
indicate that some tasks are waiting for the mutex. I am not so
sure if that is the case for the other architectures. So the
default is to avoid atomic_xchg() if the count has already been
set to -1. For x86, the check is modified to include all
negative numbers to cover a larger case.

The following table shows the jobs per minutes (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the
high_systime workloads to 1/2/4/8 nodes with hyperthreading on
and off.

+-----------------+-----------+------------+----------+
|  Configuration  | Mean JPM  |  Mean JPM  | % Change |
|		  | w/o patch | with patch |	      |
+-----------------+-----------------------------------+
|		  |      User Range 1100 - 2000	      |
+-----------------+-----------------------------------+
| 8 nodes, HT on  |    36980   |   148590  | +301.8%  |
| 8 nodes, HT off |    42799   |   145011  | +238.8%  |
| 4 nodes, HT on  |    61318   |   118445  |  +51.1%  |
| 4 nodes, HT off |   158481   |   158592  |   +0.1%  |
| 2 nodes, HT on  |   180602   |   173967  |   -3.7%  |
| 2 nodes, HT off |   198409   |   198073  |   -0.2%  |
| 1 node , HT on  |   149042   |   147671  |   -0.9%  |
| 1 node , HT off |   126036   |   126533  |   +0.4%  |
+-----------------+-----------------------------------+
|		  |       User Range 200 - 1000	      |
+-----------------+-----------------------------------+
| 8 nodes, HT on  |   41525    |   122349  | +194.6%  |
| 8 nodes, HT off |   49866    |   124032  | +148.7%  |
| 4 nodes, HT on  |   66409    |   106984  |  +61.1%  |
| 4 nodes, HT off |  119880    |   130508  |   +8.9%  |
| 2 nodes, HT on  |  138003    |   133948  |   -2.9%  |
| 2 nodes, HT off |  132792    |   131997  |   -0.6%  |
| 1 node , HT on  |  116593    |   115859  |   -0.6%  |
| 1 node , HT off |  104499    |   104597  |   +0.1%  |
+-----------------+------------+-----------+----------+

At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.

AIM7 benchmark run has a pretty large run-to-run variance due to
random nature of the subtests executed. So a difference of less
than +-5% may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant
drop-off at high node count.  The patch has practically no
impact on 1 and 2 nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function
by the high_systime workload at 1500 users for 2/4/8-node
configurations with hyperthreading off.

+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
|    8 nodes    |      65.34%     |      0.69%       |  -99%   |
|    4 nodes    |       8.70%	  |      1.02%	     |  -88%   |
|    2 nodes    |       0.41%     |      0.32%       |  -22%   |
+---------------+-----------------+------------------+---------+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads
(at 8 nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
|   Workload   | mean % change | mean % change  | mean % change   |
|              | 10-100 users  | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests     |     +0.6%     |   +104.2%      |   +185.9%       |
| five_sec     |     +1.9%     |     +0.9%      |     +0.9%       |
| fserver      |     +1.4%     |     -7.7%      |     +5.1%       |
| new_fserver  |     -0.5%     |     +3.2%      |     +3.1%       |
| shared       |    +13.1%     |   +146.1%      |   +181.5%       |
| short        |     +7.4%     |     +5.0%      |     +4.2%       |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Norton: Scott J <scott.norton@hp.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-3-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:35 +02:00
Waiman Long
41fcb9f230 mutex: Move mutex spinning code from sched/core.c back to mutex.c
As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
feature bit was really just an early hack to make with/without
mutex-spinning testable. So it is no longer necessary.

This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
move the mutex spinning code from kernel/sched/core.c back to
kernel/mutex.c which is where they should belong.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:34 +02:00
Li Zefan
712317ad97 cgroup: fix broken file xattrs
We should store file xattrs in struct cfent instead of struct cftype,
because cftype is a type while cfent is object instance of cftype.

For example each cgroup has a tasks file, and each tasks file is
associated with a uniq cfent, but all those files share the same
struct cftype.

Alexey Kodanev reported a crash, which can be reproduced:

  # mount -t cgroup -o xattr /sys/fs/cgroup
  # mkdir /sys/fs/cgroup/test
  # setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks
  # rmdir /sys/fs/cgroup/test
  # umount /sys/fs/cgroup
  oops!

In this case, simple_xattrs_free() will free the same struct simple_xattrs
twice.

tj: Dropped unused local variable @cft from cgroup_diput().

Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-18 23:11:40 -07:00
Linus Torvalds
6835039d7e Merge branch 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux
Pull user-namespace fixes from Andy Lutomirski.

* 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux:
  userns: Changing any namespace id mappings should require privileges
  userns: Check uid_map's opener's fsuid, not the current fsuid
  userns: Don't let unprivileged users trick privileged users into setting the id_map
2013-04-18 18:09:12 -07:00
Linus Torvalds
0a82a8d132 Revert "block: add missing block_bio_complete() tracepoint"
This reverts commit 3a366e614d.

Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.

Jens says:
 "It's not quite clear why that is yet, so I think we should just revert
  the commit for 3.9 final (which I'm assuming is pretty close).

  The wifi is crap at the LSF hotel, so sending this email instead of
  queueing up a revert and pull request."

Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 09:00:26 -07:00
Masami Hiramatsu
5c51543b0a kprobes: Fix a double lock bug of kprobe_mutex
Fix a double locking bug caused when debug.kprobe-optimization=0.
While the proc_kprobes_optimization_handler locks kprobe_mutex,
wait_for_kprobe_optimizer locks it again and that causes a double lock.
To fix the bug, this introduces different mutex for protecting
sysctl parameter and locks it in proc_kprobes_optimization_handler.
Of course, since we need to lock kprobe_mutex when touching kprobes
resources, that is done in *optimize_all_kprobes().

This bug was introduced by commit ad72b3bea7 ("kprobes: fix
wait_for_kprobe_optimizer()")

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 08:58:38 -07:00
Thomas Gleixner
d2054b2c11 posix-timers: Remove unused variable
Remove the unused variable *node introduced by commit 5ed67f05 (posix
timers: Allocate timer id per process)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Pavel Emelyanov <xemul@parallels.com>
2013-04-18 12:51:19 +02:00
Emese Revfy
b9e146d8eb kernel/signal.c: stop info leak via the tkill and the tgkill syscalls
This fixes a kernel memory contents leak via the tkill and tgkill syscalls
for compat processes.

This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field
when handling signals delivered from tkill.

The place of the infoleak:

int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from)
{
        ...
        put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr);
        ...
}

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Reviewed-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-17 16:10:45 -07:00
Yinghai Lu
157752d84f kexec: use Crash kernel for Crash kernel low
We can extend kexec-tools to support multiple "Crash kernel" in /proc/iomem
instead.

So we can use "Crash kernel" instead of "Crash kernel low" in /proc/iomem.

Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-3-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:34 -07:00