Pull s390 updates from Martin Schwidefsky:
"There are two memory management related changes, the CMMA support for
KVM to avoid swap-in of freed pages and the split page table lock for
the PMD level. These two come with common code changes in mm/.
A fix for the long standing theoretical TLB flush problem, this one
comes with a common code change in kernel/sched/.
Another set of changes is Heikos uaccess work, included is the initial
set of patches with more to come.
And fixes and cleanups as usual"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (36 commits)
s390/con3270: optionally disable auto update
s390/mm: remove unecessary parameter from pgste_ipte_notify
s390/mm: remove unnecessary parameter from gmap_do_ipte_notify
s390/mm: fixing comment so that parameter name match
s390/smp: limit number of cpus in possible cpu mask
hypfs: Add clarification for "weight_min" attribute
s390: update defconfigs
s390/ptrace: add support for PTRACE_SINGLEBLOCK
s390/perf: make print_debug_cf() static
s390/topology: Remove call to update_cpu_masks()
s390/compat: remove compat exec domain
s390: select CONFIG_TTY for use of tty in unconditional keyboard driver
s390/appldata_os: fix cpu array size calculation
s390/checksum: remove memset() within csum_partial_copy_from_user()
s390/uaccess: remove copy_from_user_real()
s390/sclp_early: Return correct HSA block count also for zero
s390: add some drivers/subsystems to the MAINTAINERS file
s390: improve debug feature usage
s390/airq: add support for irq ranges
s390/mm: enable split page table lock for PMD level
...
I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:
linux-next commit c365c292d0
"sched: Consider pi boosting in setscheduler()"
And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:
- p->normal_prio = normal_prio(p);
- p->prio = rt_mutex_getprio(p);
With no
+ p->normal_prio = normal_prio(p);
+ p->prio = rt_mutex_getprio(p);
Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.
The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.
Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Deny the use of SCHED_DEADLINE policy to unprivileged users.
Even if root users can set the policy for normal users, we
don't want the latter to be able to change their parameters
(safest behavior).
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393844961-18097-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Michael spotted that the idle_balance() push down created a task
priority problem.
Previously, when we called idle_balance() before pick_next_task() it
wasn't a problem when -- because of the rq->lock droppage -- an rt/dl
task slipped in.
Similarly for pre_schedule(), rt pre-schedule could have a dl task
slip in.
But by pulling it into the pick_next_task() loop, we'll not try a
higher task priority again.
Cure this by creating a re-start condition in pick_next_task(); and
triggering this from pick_next_task_{rt,fair}().
It also fixes a live-lock where we get stuck in pick_next_task_fair()
due to idle_balance() seeing !0 nr_running but there not actually
being any fair tasks about.
Reported-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Fixes: 38033c37fa ("sched: Push down pre_schedule() and idle_balance()")
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a leftover from commit e23ee74777
("sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list").
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/52F5CBF6.4060901@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a PI boosted task policy/priority is modified by a setscheduler()
call we unconditionally dequeue and requeue the task if it is on the
runqueue even if the new priority is lower than the current effective
boosted priority. This can result in undesired reordering of the
priority bucket list.
If the new priority is less or equal than the current effective we
just store the new parameters in the task struct and leave the
scheduler class and the runqueue untouched. This is handled when the
task deboosts itself. Only if the new priority is higher than the
effective boosted priority we apply the change immediately.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-7-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following scenario does not work correctly:
Runqueue of CPUx contains two runnable and pinned tasks:
T1: SCHED_FIFO, prio 80
T2: SCHED_FIFO, prio 80
T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):
sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
...
sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
...
Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!
The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.
This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.
We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.
It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-6-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the policy and priority remain unchanged a possible modification of
p->sched_reset_on_fork gets lost in the early exit path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-5-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
might_sleep() can tell us where interrupts have been disabled, but we
have no idea what disabled preemption. Add some debug infrastructure.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-4-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Idle is not allowed to call sleeping functions ever!
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-3-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We stumbled in RT over a SMP bringup issue on ARM where the
idle->on_rq == 0 was causing try_to_wakeup() on the other cpu to run
into nada land.
After adding that idle->on_rq = 1; I was able to find the root cause
of the lockup: the idle task on the newly woken up cpu was fiddling
with a sleeping spinlock, which is a nono.
I kept the init of idle->on_rq to keep the state consistent and to
avoid another long lasting debug session.
As a side note, the whole debug mess could have been avoided if
might_sleep() would have yelled when called from the idle task. That's
fixed with patch 2/6 - and that one actually has a changelog :)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-2-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dan Carpenter reported:
> kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338)
> kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005)
Kirill also spotted that migrate_tasks() will have an instant NULL
deref because pick_next_task() will immediately deref prev.
Instead of fixing all the corner cases because migrate_tasks() can
pass in a NULL prev task in the unlikely case of hot-un-plug, provide
a fake task such that we can remove all the NULL checks from the far
more common paths.
A further problem; not previously spotted; is that because we pushed
pre_schedule() and idle_balance() into pick_next_task() we now need to
avoid those getting called and pulling more tasks on our dying CPU.
We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1.
We also note that since we call pick_next_task() exactly the amount of
times we have runnable tasks present, we should never land in
idle_balance().
Fixes: 38033c37fa ("sched: Push down pre_schedule() and idle_balance()")
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Kirill Tkhai <tkhai@yandex.ru>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Because of a recent syscall design debate; its deemed appropriate for
each syscall to have a flags argument for future extension; without
immediately requiring new syscalls.
Cc: juri.lelli@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We're copying the on-stack structure to userspace, but forgot to give
the right number of bytes to copy. This allows the calling process to
obtain up to PAGE_SIZE bytes from the stack (and possibly adjacent
kernel memory).
This fix copies only as much as we actually have on the stack
(attr->size defaults to the size of the struct) and leaves the rest of
the userspace-provided buffer untouched.
Found using kmemcheck + trinity.
Fixes: d50dde5a10 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1392585857-10725-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix this lockdep warning:
[ 44.804600] =========================================================
[ 44.805746] [ INFO: possible irq lock inversion dependency detected ]
[ 44.805746] 3.14.0-rc2-test+ #14 Not tainted
[ 44.805746] ---------------------------------------------------------
[ 44.805746] bash/3674 just changed the state of lock:
[ 44.805746] (&dl_b->lock){+.....}, at: [<ffffffff8106ad15>] sched_rt_handler+0x132/0x248
[ 44.805746] but this lock was taken by another, HARDIRQ-safe lock in the past:
[ 44.805746] (&rq->lock){-.-.-.}
and interrupts could create inverse lock ordering between them.
[ 44.805746]
[ 44.805746] other info that might help us debug this:
[ 44.805746] Possible interrupt unsafe locking scenario:
[ 44.805746]
[ 44.805746] CPU0 CPU1
[ 44.805746] ---- ----
[ 44.805746] lock(&dl_b->lock);
[ 44.805746] local_irq_disable();
[ 44.805746] lock(&rq->lock);
[ 44.805746] lock(&dl_b->lock);
[ 44.805746] <Interrupt>
[ 44.805746] lock(&rq->lock);
by making dl_b->lock acquiring always IRQ safe.
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1392107067-19907-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Don't compare sysctl_sched_rt_runtime against sysctl_sched_rt_period if
the former is equal to RUNTIME_INF, otherwise disabling -rt bandwidth
management (with CONFIG_RT_GROUP_SCHED=n) fails.
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1392107067-19907-2-git-send-email-juri.lelli@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
While debugging the crash with the bad nr_running accounting, I hit
another bug where, after running my sched deadline test, I was getting
failures to take a CPU offline. It was giving me a -EBUSY error.
Adding a bunch of trace_printk()s around, I found that the cpu
notifier that called sched_cpu_inactive() was returning a failure. The
overflow value was coming up negative?
Talking this over with Juri, the problem is that the total_bw update was
suppose to be made by dl_overflow() which, during my tests, seemed to
not be called. Adding more trace_printk()s, it wasn't that it wasn't
called, but it exited out right away with the check of new_bw being
equal to p->dl.dl_bw. The new_bw calculates the ratio between period and
runtime. The bug is that if you set a deadline, you do not need to set
a period if you plan on the period being equal to the deadline. That
is, if period is zero and deadline is not, then the system call should
set the period to be equal to the deadline. This is done elsewhere in
the code.
The fix is easy, check if period is set, and if it is not, then use the
deadline.
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140219135335.7e74abd4@gandalf.local.home
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The finish_arch_post_lock_switch is called at the end of the task
switch after all locks have been released. In concept it is paired
with the switch_mm function, but the current code only does the
call in finish_task_switch. Add the call to idle_task_exit and
use_mm. One use case for the additional calls is s390 which will
use finish_arch_post_lock_switch to wait for the completion of
TLB flush operations.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost.
It's useful to know these values in debug mode.
Signed-off-by: Alex Shi <alex.shi@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/52E0F3BF.5020904@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch both merged idle_balance() and pre_schedule() and pushes
both of them into pick_next_task().
Conceptually pre_schedule() and idle_balance() are rather similar,
both are used to pull more work onto the current CPU.
We cannot however first move idle_balance() into pre_schedule_fair()
since there is no guarantee the last runnable task is a fair task, and
thus we would miss newidle balances.
Similarly, the dl and rt pre_schedule calls must be ran before
idle_balance() since their respective tasks have higher priority and
it would not do to delay their execution searching for less important
tasks first.
However, by noticing that pick_next_tasks() already traverses the
sched_class hierarchy in the right order, we can get the right
behaviour and do away with both calls.
We must however change the special case optimization to also require
that prev is of sched_class_fair, otherwise we can miss doing a dl or
rt pull where we needed one.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to avoid having to do put/set on a whole cgroup hierarchy
when we context switch, push the put into pick_next_task() so that
both operations are in the same function. Further changes then allow
us to possibly optimize away redundant work.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
idle_balance() modifies the rq->idle_stamp field, making this information
shared across core.c and fair.c.
As we know if the cpu is going to idle or not with the previous patch, let's
encapsulate the rq->idle_stamp information in core.c by moving it up to the
caller.
The idle_balance() function returns true in case a balancing occured and the
cpu won't be idle, false if no balance happened and the cpu is going idle.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: alex.shi@linaro.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389949444-14821-3-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As patch "sched: Move the priority specific bits into a new header file" exposes
the priority related macros in linux/sched/prio.h, we don't have to implement
task_nice() in kernel/sched/core.c any more.
This patch implements it in linux/sched/sched.h as static inline function,
saving the kernel stack and enhancing performance a bit.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: clark.williams@gmail.com
Cc: rostedt@goodmis.org
Cc: raistlin@linux.it
Cc: juri.lelli@gmail.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390878045-7096-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull timer/dynticks updates from Ingo Molnar:
"This tree contains misc dynticks updates: a fix and three cleanups"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/nohz: Fix overflow error in scheduler_tick_max_deferment()
nohz_full: fix code style issue of tick_nohz_full_stop_tick
nohz: Get timekeeping max deferment outside jiffies_lock
tick: Rename tick_check_idle() to tick_irq_enter()
Tracing the code that decides the active nodes has made it abundantly clear
that the naive implementation of the faults_from code has issues.
Specifically, the garbage collector in some workloads will access orders
of magnitudes more memory than the threads that do all the active work.
This resulted in the node with the garbage collector being marked the only
active node in the group.
This issue is avoided if we weigh the statistics by CPU use of each task in
the numa group, instead of by how many faults each thread has occurred.
To achieve this, we normalize the number of faults to the fraction of faults
that occurred on each node, and then multiply that fraction by the fraction
of CPU time the task has used since the last time task_numa_placement was
invoked.
This way the nodes in the active node mask will be the ones where the tasks
from the numa group are most actively running, and the influence of eg. the
garbage collector and other do-little threads is properly minimized.
On a 4 node system, using CPU use statistics calculated over a longer interval
results in about 1% fewer page migrations with two 32-warehouse specjbb runs
on a 4 node system, and about 5% fewer page migrations, as well as 1% better
throughput, with two 8-warehouse specjbb runs, as compared with the shorter
term statistics kept by the scheduler.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Link: http://lkml.kernel.org/r/1390860228-21539-7-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to get a more consistent naming scheme, making it clear
which fault statistics track memory locality, and which track
CPU locality, rename the memory fault statistics.
Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Link: http://lkml.kernel.org/r/1390860228-21539-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Not all classes implement (or can implement) a useful get_rr_interval()
function, default to a 0 time-slice for them.
This fixes a crash reported by Tommi Rantala.
Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140127105413.GC11314@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a working sysctl to enable/disable automatic numa memory balancing
at runtime.
This allows us to track down performance problems with this feature and
is generally a good idea.
This was possible earlier through debugfs, but only with special
debugging options set. Also fix the boot message.
[akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge first patch-bomb from Andrew Morton:
- a couple of misc things
- inotify/fsnotify work from Jan
- ocfs2 updates (partial)
- about half of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
mm/migrate: remove unused function, fail_migrate_page()
mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
mm/migrate: correct failure handling if !hugepage_migration_support()
mm/migrate: add comment about permanent failure path
mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
mm: compaction: reset scanner positions immediately when they meet
mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
mm: compaction: detect when scanners meet in isolate_freepages
mm: compaction: reset cached scanner pfn's before reading them
mm: compaction: encapsulate defer reset logic
mm: compaction: trace compaction begin and end
memcg, oom: lock mem_cgroup_print_oom_info
sched: add tracepoints related to NUMA task migration
mm: numa: do not automatically migrate KSM pages
mm: numa: trace tasks that fail migration due to rate limiting
mm: numa: limit scope of lock for NUMA migrate rate limiting
mm: numa: make NUMA-migrate related functions static
lib/show_mem.c: show num_poisoned_pages when oom
mm/hwpoison: add '#' to hwpoison_inject
mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter
...
Pull cgroup updates from Tejun Heo:
"The bulk of changes are cleanups and preparations for the upcoming
kernfs conversion.
- cgroup_event mechanism which is and will be used only by memcg is
moved to memcg.
- pidlist handling is updated so that it can be served by seq_file.
Also, the list is not sorted if sane_behavior. cgroup
documentation explicitly states that the file is not sorted but it
has been for quite some time.
- All cgroup file handling now happens on top of seq_file. This is
to prepare for kernfs conversion. In addition, all operations are
restructured so that they map 1-1 to kernfs operations.
- Other cleanups and low-pri fixes"
* 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits)
cgroup: trivial style updates
cgroup: remove stray references to css_id
doc: cgroups: Fix typo in doc/cgroups
cgroup: fix fail path in cgroup_load_subsys()
cgroup: fix missing unlock on error in cgroup_load_subsys()
cgroup: remove for_each_root_subsys()
cgroup: implement for_each_css()
cgroup: factor out cgroup_subsys_state creation into create_css()
cgroup: combine css handling loops in cgroup_create()
cgroup: reorder operations in cgroup_create()
cgroup: make for_each_subsys() useable under cgroup_root_mutex
cgroup: css iterations and css_from_dir() are safe under cgroup_mutex
cgroup: unify pidlist and other file handling
cgroup: replace cftype->read_seq_string() with cftype->seq_show()
cgroup: attach cgroup_open_file to all cgroup files
cgroup: generalize cgroup_pidlist_open_file
cgroup: unify read path so that seq_file is always used
cgroup: unify cgroup_write_X64() and cgroup_write_string()
cgroup: remove cftype->read(), ->read_map() and ->write()
hugetlb_cgroup: convert away from cftype->read()
...
This patch adds three tracepoints
o trace_sched_move_numa when a task is moved to a node
o trace_sched_swap_numa when a task is swapped with another task
o trace_sched_stick_numa when a numa-related migration fails
The tracepoints allow the NUMA scheduler activity to be monitored and the
following high-level metrics can be calculated
o NUMA migrated stuck nr trace_sched_stick_numa
o NUMA migrated idle nr trace_sched_move_numa
o NUMA migrated swapped nr trace_sched_swap_numa
o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen)
o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid
Maybe a small number of these are acceptable
but a high number would be a major surprise.
It would be even worse if bounces are frequent.
o NUMA avg task migs. Average number of migrations for tasks
o NUMA stddev task mig Self-explanatory
o NUMA max task migs. Maximum number of migrations for a single task
In general the intent of the tracepoints is to help diagnose problems
where automatic NUMA balancing appears to be doing an excessive amount
of useless work.
[akpm@linux-foundation.org: remove semicolon-after-if, repair coding-style]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the introduction of sched_attr::sched_nice we need to check
if we've got permission to actually change the nice value.
Daniel found that can_nice() would always fail; and upon
inspection it turns out that can_nice() only tests to see if we
can lower the nice value, but it doesn't validate if we're
lowering or not.
Therefore amend the test to only call can_nice() when we lower
the nice value.
Reported-and-Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: raistlin@linux.it
Cc: juri.lelli@gmail.com
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Fixes: d50dde5a10 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
Link: http://lkml.kernel.org/r/20140116165425.GA9481@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I noticed the new sched_{set,get}attr() calls didn't properly deal
with the SCHED_RESET_ON_FORK hack.
Instead of propagating the flags in high bits nonsense use the brand
spanking new attr::sched_flags field.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Dario Faggioli <raistlin@linux.it>
Link: http://lkml.kernel.org/r/20140115162242.GJ31570@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fengguang Wu reported the following build warning:
> kernel/sched/core.c:3067 __sched_setscheduler() warn: unsigned 'attr->sched_priority' is never less than zero.
Since it doesn't make sense for attr::sched_priority to be negative,
remove the check, since we already test for an upper limit any actual
negative values passed in through the old param::sched_priority field
will still be detected.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Dario Faggioli <raistlin@linux.it>
Fixes: d50dde5a10 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
Link: http://lkml.kernel.org/n/tip-fid9nalzii2r5voxtf4eh5kz@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Wu reported LTP failures:
> ltp.sched_setparam02.1.TFAIL
> ltp.sched_setparam02.2.TFAIL
> ltp.sched_setparam02.3.TFAIL
> ltp.sched_setparam03.1.TFAIL
There were 2 things wrong; firstly __setscheduler() failed on
sched_setparam()'s policy = -1, fix that by reading from p->policy in
that case.
Secondly, getparam() (and getattr()) would still report !0
sched_priority for !FIFO/RR tasks after having been such. So
unconditionally set p->rt_priority.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Dario Faggioli <raistlin@linux.it>
Fixes: d50dde5a10 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
Link: http://lkml.kernel.org/r/20140115153320.GH31570@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Previously sched_setscheduler() and sched_setparam() would not affect
the nice value of a task, restore this behaviour.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: raistlin@linux.it
Cc: juri.lelli@gmail.com
Cc: Michael wang <wangyun@linux.vnet.ibm.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Fixes: d50dde5a10 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
Link: http://lkml.kernel.org/r/20140115113015.GB31570@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fengguang Wu's kbuild test robot reported the following new htmldocs warnings:
>>> Warning(kernel/sched/core.c:3380): No description found for parameter 'uattr'
>>> Warning(kernel/sched/core.c:3380): Excess function parameter 'attr' description in 'sys_sched_setattr'
>>> Warning(kernel/sched/core.c:3520): No description found for parameter 'uattr'
>>> Warning(kernel/sched/core.c:3520): Excess function parameter 'attr' description in 'sys_sched_getattr'
The second argument to sys_sched_{setattr,getattr}() is named uattr (not attr).
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Fixes: d50dde5a10 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
Link: http://lkml.kernel.org/r/52D5552D.5000102@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fix these new sparse warnings:
>> kernel/sched/core.c:305:14: sparse: symbol 'sysctl_sched_dl_period' was not declared. Should it be static?
>> kernel/sched/core.c:306:5: sparse: symbol 'sysctl_sched_dl_runtime' was not declared. Should it be static?
Better still, they're completely unused so remove them.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/n/tip-ke0shkG7vMnzmcdqhhiymyem@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While calculating the scheduler tick max deferment, the delta is
converted from microseconds to nanoseconds through a multiplication
against NSEC_PER_USEC.
But this microseconds operand is an unsigned int, thus the result may
likely overflow. The result is cast to u64 but only once the operation
is completed, which is too late to avoid overflown result.
This is currently not a problem because the scheduler tick max deferment
is 1 second. But this may become an issue as we plan to make this
value tunable.
So lets fix this by casting the usecs value to u64 before multiplying by
NSECS_PER_USEC.
Also to prevent from this kind of mistake to happen again, move this
ad-hoc jiffies -> nsecs conversion to a new helper.
Signed-off-by: Kevin Hilman <khilman@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Kevin Hilman <khilman@linaro.org>
Link: http://lkml.kernel.org/r/1387315388-31676-2-git-send-email-khilman@linaro.org
[move ad-hoc conversion to jiffies_to_nsecs helper]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
With various drivers wanting to inject idle time; we get people
calling idle routines outside of the idle loop proper.
Therefore we need to be extra careful about not missing
TIF_NEED_RESCHED -> PREEMPT_NEED_RESCHED propagations.
While looking at this, I also realized there's a small window in the
existing idle loop where we can miss TIF_NEED_RESCHED; when it hits
right after the tif_need_resched() test at the end of the loop but
right before the need_resched() test at the start of the loop.
So move preempt_fold_need_resched() out of the loop where we're
guaranteed to have TIF_NEED_RESCHED set.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-x9jgh45oeayzajz2mjt0y7d6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current hotplug admission control is broken because:
CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task()
cannot fail and hard assumes it _will_ move all tasks off of the dying
cpu, failing this will break hotplug.
The much simpler solution is a DOWN_PREPARE handler that fails when
removing one CPU gets us below the total allocated bandwidth.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131220171343.GL2480@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the deadline specific sysctls for now. The problem with them is
that the interaction with the exisiting rt knobs is nearly impossible
to get right.
The current (as per before this patch) situation is that the rt and dl
bandwidth is completely separate and we enforce rt+dl < 100%. This is
undesirable because this means that the rt default of 95% leaves us
hardly any room, even though dl tasks are saver than rt tasks.
Another proposed solution was (a discarted patch) to have the dl
bandwidth be a fraction of the rt bandwidth. This is highly
confusing imo.
Furthermore neither proposal is consistent with the situation we
actually want; which is rt tasks ran from a dl server. In which case
the rt bandwidth is a direct subset of dl.
So whichever way we go, the introduction of dl controls at this point
is painful. Therefore remove them and instead share the rt budget.
This means that for now the rt knobs are used for dl admission control
and the dl runtime is accounted against the rt runtime. I realise that
this isn't entirely desirable either; but whatever we do we appear to
need to change the interface later, so better have a small interface
for now.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>