Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.
This is done mainly because:
- classical prio field of the plist is just an int, which might
not be enough for representing a deadline;
- manipulating such a list would become O(nr_deadline_tasks),
which might be to much, as the number of -deadline task increases.
Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
- among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
one with the higher (lower, actually!) prio wins;
- among a -priority and a -deadline task, the latter always wins;
- among two -deadline tasks, the one with the earliest deadline
wins.
Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-again-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.
For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.
As a consequence of applying this patch there will be three wakeup
latency tracer:
* "wakeup", that deals with all tasks in the system;
* "wakeup_rt", that deals with -rt and -deadline tasks only;
* "wakeup_dl", that deals with -deadline tasks only.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-9-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make it possible to specify a period (different or equal than
deadline) for -deadline tasks. Relative deadlines (D_i) are used on
task arrivals to generate new scheduling (absolute) deadlines as "d =
t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d
= d + P_i" when the budget is zero.
This is in general useful to model (and schedule) tasks that have slow
activation rates (long periods), but have to be scheduled soon once
activated (short deadlines).
Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-7-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make the core scheduler and load balancer aware of the load
produced by -deadline tasks, by updating the moving average
like for sched_rt.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-6-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.
Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.
The very same approach used in sched_rt is utilised:
- -deadline tasks are kept into CPU-specific runqueues,
- -deadline tasks are migrated among runqueues to achieve the
following:
* on an M-CPU system the M earliest deadline ready tasks
are always running;
* affinity/cpusets settings of all the -deadline tasks is
always respected.
Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.
To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.
In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Thomas Hellstrom bisected a regression where erratic 3D performance is
experienced on virtual machines as measured by glxgears. It identified
commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA
node") as the problem which had modified the behaviour of effective_load.
Effective load calculates the difference to the system-wide load if a
scheduling entity was moved to another CPU. The task group is not heavier
as a result of the move but overall system load can increase/decrease as a
result of the change. Commit 58d081b5 ("sched/numa: Avoid overloading CPUs
on a preferred NUMA node") changed effective_load to make it suitable for
calculating if a particular NUMA node was compute overloaded. To reduce
the cost of the function, it assumed that a current sched entity weight
of 0 was uninteresting but that is not the case.
wake_affine() uses a weight of 0 for sync wakeups on the grounds that it
is assuming the waking task will sleep and not contribute to load in the
near future. In this case, we still want to calculate the effective load
of the sched entity hierarchy. As effective_load is no longer used by
task_numa_compare since commit fb13c7ee (sched/numa: Use a system-wide
search to find swap/migration candidates), this patch simply restores the
historical behaviour.
Reported-and-tested-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
[ Wrote changelog]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140106113912.GC6178@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Fix for a cpufreq regression causing stale sysfs files to be left
behind during system resume if cpufreq_add_dev() fails for one or
more CPUs from Viresh Kumar.
- Fix for a bug in cpufreq causing CONFIG_CPU_FREQ_DEFAULT_* to be
ignored when the intel_pstate driver is used from Jason Baron.
- System suspend fix for a memory leak in pm_vt_switch_unregister()
that forgot to release objects after removing them from
pm_vt_switch_list. From Masami Ichikawa.
- Intel Valley View device ID and energy unit encoding update for the
(recently added) Intel RAPL (Running Average Power Limit) driver
from Jacob Pan.
- Intel Bay Trail SoC GPIO and ACPI device IDs for the Low Power
Subsystem (LPSS) ACPI driver from Paul Drews.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABCAAGBQJSvh9MAAoJEILEb/54YlRxfJoQAJFzcqXlYsROgPSlYVEG0F80
Cop0MbYC/gr8XNiLXWGIRVVYHbxMNeAPk5vUPZg4E9L6cOeazwjFtnRB/Av3FpVo
XReYHpLbJXJ3VaVyJiw0tCHp/Ukw8Ds0VcURi8RdcrQdkmyXPtbfWcrE+7GmuA2z
jnZOJviws+mTnxdEHtaml2iZMM5jwvUmUeh3iytc8zOC3QR4I7cLkKnYNTrQatqZ
qYxu5e9VAKuTXBv7BeNHiViakKhoWPx0S3nKofoiOG5hGwg49HGVlJ3pH9CCfIli
jA1NpXOGyKzYLJv2fJPtgxQ+l7Mb8wu9hGbPJWaUI3MRa9vIxNok6qzYwAQcfCWD
p4iugfsaatyKbBSBu+mntczCM7wsl2+/gH3gWfDySRpxq8G9At1dduO9GMeD/pDi
QhYlFl0obR05F6R0hlkk2Pahx+5x5nub7dM2+8Oh+r8k6TlkFRg+BKe21MJGz/45
BHBmJkNkLpdUNKT2GhaQK6rhc5TSln3eYGjYDRRhRmV6/4US/hI1MtY6HYg2uWbk
J3xiMcUXAY/0DzC1zDzvwr4Cc+WRdNXbZGKmmhyc+fxEmjZZVGanfmqC0JFV3gni
32v1krQA8v3KANz2xjvnNwNLQzSypgVOHihUbPN34FGaE7fxp6PWqxwhRD6RyywK
gtIHNSZ0mKnmIR6oxq1a
=vQDX
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes and new device IDs from Rafael Wysocki:
- Fix for a cpufreq regression causing stale sysfs files to be left
behind during system resume if cpufreq_add_dev() fails for one or
more CPUs from Viresh Kumar.
- Fix for a bug in cpufreq causing CONFIG_CPU_FREQ_DEFAULT_* to be
ignored when the intel_pstate driver is used from Jason Baron.
- System suspend fix for a memory leak in pm_vt_switch_unregister()
that forgot to release objects after removing them from
pm_vt_switch_list. From Masami Ichikawa.
- Intel Valley View device ID and energy unit encoding update for the
(recently added) Intel RAPL (Running Average Power Limit) driver from
Jacob Pan.
- Intel Bay Trail SoC GPIO and ACPI device IDs for the Low Power
Subsystem (LPSS) ACPI driver from Paul Drews.
* tag 'pm+acpi-3.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
powercap / RAPL: add support for ValleyView Soc
PM / sleep: Fix memory leak in pm_vt_switch_unregister().
cpufreq: Use CONFIG_CPU_FREQ_DEFAULT_* to set initial policy for setpolicy drivers
cpufreq: remove sysfs files for CPUs which failed to come back after resume
ACPI: Add BayTrail SoC GPIO and LPSS ACPI IDs
* pm-cpufreq:
cpufreq: Use CONFIG_CPU_FREQ_DEFAULT_* to set initial policy for setpolicy drivers
cpufreq: remove sysfs files for CPUs which failed to come back after resume
* pm-sleep:
PM / sleep: Fix memory leak in pm_vt_switch_unregister().
Pull cgroup fixes from Tejun Heo:
"Two fixes. One fixes a bug in the error path of cgroup_create(). The
other changes cgrp->id lifetime rule so that the id doesn't get
recycled before all controller states are destroyed. This premature
id recycling made memcg malfunction"
* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: don't recycle cgroup id until all csses' have been destroyed
cgroup: fix cgroup_create() error handling path
Pull libata fixes from Tejun Heo:
"There's one interseting commit - "libata, freezer: avoid block device
removal while system is frozen". It's an ugly hack working around a
deadlock condition between driver core resume and block layer device
removal paths through freezer which was made more reproducible by
writeback being converted to workqueue some releases ago. The bug has
nothing to do with libata but it's just an workaround which is easy to
backport. After discussion, Rafael and I seem to agree that we don't
really need kernel freezables - both kthread and workqueue. There are
few specific workqueues which constitute PM operations and require
freezing, which will be converted to use workqueue_set_max_active()
instead. All other kernel freezer uses are planned to be removed,
followed by the removal of kthread and workqueue freezer support,
hopefully.
Others are device-specific fixes. The most notable is the addition of
NO_NCQ_TRIM which is used to disable queued TRIM commands to Micro
M500 SSDs which otherwise suffers data corruption"
* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
libata, freezer: avoid block device removal while system is frozen
libata: implement ATA_HORKAGE_NO_NCQ_TRIM and apply it to Micro M500 SSDs
libata: disable a disk via libata.force params
ahci: bail out on ICH6 before using AHCI BAR
ahci: imx: Explicitly clear IMX6Q_GPR13_SATA_MPLL_CLK_EN
libata: add ATA_HORKAGE_BROKEN_FPDMA_AA quirk for Seagate Momentus SpinPoint M8
In struct page we have enough space to fit long-size page->ptl there,
but we use dynamically-allocated page->ptl if size(spinlock_t) is larger
than sizeof(int).
It hurts 64-bit architectures with CONFIG_GENERIC_LOCKBREAK, where
sizeof(spinlock_t) == 8, but it easily fits into struct page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The problem is that the profiler only initializes the online
CPUs, and not possible CPUs. This causes issues if the user takes
CPUs online or offline while the profiler is running.
If we online a CPU after starting the profiler, we lose all the
trace information on the CPU going online.
If we offline a CPU after running a test and start a new test, it
will not clear the old data from that CPU.
This bug causes incorrect data to be reported to the user if they
online or offline CPUs during the profiling.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSsNBHAAoJEKQekfcNnQGuKP8H/2mol/d7z2vANh7/FeNjTKIN
VkRzDEwUIwoaJBsL75EDDXBFx7w8jjAsXyoTrqrvMRV4UNcsfm46mohQTPAmK39y
muqodL1VnVXdKrUmtw/1nL7yDi2KltQH1UwOgvwXGuUFIq5cuCXNQxNK9/1fVVVn
tIMNz5kEAG3XCwnqP0PgQxWCuA7s+aQR0ijTf4vPf1G3IJujPyG9VhJWcGS3dJTR
t8TPyatd9D/S+7/r7iZ9hS8nWpaka3qJfhiWqk16SC9LiUXVA8oFOVMoN7n6Co5E
6r2dNo01WOABlojCxi1t3afUtcV1bUjBnVkiDva5cSc84pQSxe1qRrIpjTmHk00=
=MSZs
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"This fixes a long standing bug in the ftrace profiler. The problem is
that the profiler only initializes the online CPUs, and not possible
CPUs. This causes issues if the user takes CPUs online or offline
while the profiler is running.
If we online a CPU after starting the profiler, we lose all the trace
information on the CPU going online.
If we offline a CPU after running a test and start a new test, it will
not clear the old data from that CPU.
This bug causes incorrect data to be reported to the user if they
online or offline CPUs during the profiling"
* tag 'trace-fixes-v3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Initialize the ftrace profiler for each possible cpu
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.
During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.
839a8e8660 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.
I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(
v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
as a module.
v3: Comment updated and polling interval changed to 10ms as suggested
by Rafael.
v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
defined when FREEZER is not configured thus breaking build.
Reported by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
Reviewed-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Cc: kbuild test robot <fengguang.wu@intel.com>
Pull perf fixes from Ingo Molnar:
"An ABI documentation fix, and a mixed-PMU perf-info-corruption fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Document the new transaction sample type
perf: Disable all pmus on unthrottling and rescheduling
Merge patches from Andrew Morton:
"23 fixes and a MAINTAINERS update"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits)
mm/hugetlb: check for pte NULL pointer in __page_check_address()
fix build with make 3.80
mm/mempolicy: fix !vma in new_vma_page()
MAINTAINERS: add Davidlohr as GPT maintainer
mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
mm/compaction: respect ignore_skip_hint in update_pageblock_skip
mm/mempolicy: correct putback method for isolate pages if failed
mm: add missing dependency in Kconfig
sh: always link in helper functions extracted from libgcc
mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
mm: numa: defer TLB flush for THP migration as long as possible
mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates
mm: fix TLB flush race between migration, and change_protection_range
mm: numa: avoid unnecessary disruption of NUMA hinting during migration
mm: numa: clear numa hinting information on mprotect
sched: numa: skip inaccessible VMAs
mm: numa: avoid unnecessary work on the failure path
mm: numa: ensure anon_vma is locked to prevent parallel THP splits
mm: numa: do not clear PTE for pte_numa update
mm: numa: do not clear PMD during PTE update scan
...
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Inaccessible VMA should not be trapping NUMA hint faults. Skip them.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1b3a5d02ee ("reboot: move arch/x86 reboot= handling to generic
kernel") moved reboot= handling to generic code. In the process it also
removed the code in native_machine_shutdown() which are moving reboot
process to reboot_cpu/cpu0.
I guess that thought must have been that all reboot paths are calling
migrate_to_reboot_cpu(), so we don't need this special handling. But
kexec reboot path (kernel_kexec()) is not calling
migrate_to_reboot_cpu() so above change broke kexec. Now reboot can
happen on non-boot cpu and when INIT is sent in second kerneo to bring
up BP, it brings down the machine.
So start calling migrate_to_reboot_cpu() in kexec reboot path to avoid
this problem.
Bisected by WANG Chao.
Reported-by: Matthew Whitehead <mwhitehe@redhat.com>
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Tested-by: WANG Chao <chaowang@redhat.com>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull crypto key patches from David Howells:
"There are four items:
- A patch to fix X.509 certificate gathering. The problem was that I
was coming up with a different path for signing_key.x509 in the
build directory if it didn't exist to if it did exist. This meant
that the X.509 cert container object file would be rebuilt on the
second rebuild in a build directory and the kernel would get
relinked.
- Unconditionally remove files generated by SYSTEM_TRUSTED_KEYRING=y
when doing make mrproper.
- Actually initialise the persistent-keyring semaphore for
init_user_ns. I have no idea why this works at all for users in
the base user namespace unless it's something to do with systemd
containerising the system.
- Documentation for module signing"
* 'keys-devel' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
Add Documentation/module-signing.txt file
KEYS: fix uninitialized persistent_keyring_register_sem
KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y
X.509: Fix certificate gathering
Pull scheduler fixes from Ingo Molnar:
"Three fixes for scheduler crashes, each triggers in relatively rare,
hardware environment dependent situations"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Rework sched_fair time accounting
math64: Add mul_u64_u32_shr()
sched: Remove PREEMPT_NEED_RESCHED from generic code
sched: Initialize power_orig for overlapping groups
The original code is as intended and was meant to scale the difference
between the NUMA_PERIOD_THRESHOLD and local/remote ratio when adjusting
the scan period. The period_slot recalculation can be dropped.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1386833006-6600-4-git-send-email-liwanp@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use wrapper function task_faults_idx to calculate index in group_faults.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1386833006-6600-3-git-send-email-liwanp@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use wrapper function task_node to get node which task is on.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1386833006-6600-2-git-send-email-liwanp@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit 887c290e (sched/numa: Decide whether to favour task or group weights
based on swap candidate relationships) drop the check against
sysctl_numa_balancing_settle_count, this patch remove the sysctl.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liwanp@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch touches the RT group scheduling case.
Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.
The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq. The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.
The patch below fixes the problem.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
dereference on sd_busy but the fix also altered what scheduling domain it
used for the 'sd_llc' percpu variable.
One impact of this is that a task selecting a runqueue may consider
idle CPUs that are not cache siblings as candidates for running.
Tasks are then running on CPUs that are not cache hot.
This was found through bisection where ebizzy threads were not seeing equal
performance and it looked like a scheduling fairness issue. This patch
mitigates but does not completely fix the problem on all machines tested
implying there may be an additional bug or a common root cause. Here are
the average range of performance seen by individual ebizzy threads. It
was tested on top of candidate patches related to x86 TLB range flushing.
4-core machine
3.13.0-rc3 3.13.0-rc3
vanilla fixsd-v3r3
Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%)
Mean 2 0.34 ( 0.00%) 0.10 ( 70.59%)
Mean 3 1.29 ( 0.00%) 0.93 ( 27.91%)
Mean 4 7.08 ( 0.00%) 0.77 ( 89.12%)
Mean 5 193.54 ( 0.00%) 2.14 ( 98.89%)
Mean 6 151.12 ( 0.00%) 2.06 ( 98.64%)
Mean 7 115.38 ( 0.00%) 2.04 ( 98.23%)
Mean 8 108.65 ( 0.00%) 1.92 ( 98.23%)
8-core machine
Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%)
Mean 2 0.40 ( 0.00%) 0.21 ( 47.50%)
Mean 3 23.73 ( 0.00%) 0.89 ( 96.25%)
Mean 4 12.79 ( 0.00%) 1.04 ( 91.87%)
Mean 5 13.08 ( 0.00%) 2.42 ( 81.50%)
Mean 6 23.21 ( 0.00%) 69.46 (-199.27%)
Mean 7 15.85 ( 0.00%) 101.72 (-541.77%)
Mean 8 109.37 ( 0.00%) 19.13 ( 82.51%)
Mean 12 124.84 ( 0.00%) 28.62 ( 77.07%)
Mean 16 113.50 ( 0.00%) 24.16 ( 78.71%)
It's eliminated for one machine and reduced for another.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: H Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, only one PMU in a context gets disabled during unthrottling
and event_sched_{out,in}(), however, events in one context may belong to
different pmus, which results in PMUs being reprogrammed while they are
still enabled.
This means that mixed PMU use [which is rare in itself] resulted in
potentially completely unreliable results: corrupted events, bogus
results, etc.
This patch temporarily disables PMUs that correspond to
each event in the context while these events are being modified.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: http://lkml.kernel.org/r/1387196256-8030-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Hugh reported this bug:
> CONFIG_MEMCG_SWAP is broken in 3.13-rc. Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1 # let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319c ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.
Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.
To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.
tj: Updated comment to note planned changes for cgrp->id.
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Ftrace currently initializes only the online CPUs. This implementation has
two problems:
- If we online a CPU after we enable the function profile, and then run the
test, we will lose the trace information on that CPU.
Steps to reproduce:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# echo 1 > /sys/devices/system/cpu/cpu1/online
# run test
- If we offline a CPU before we enable the function profile, we will not clear
the trace information when we enable the function profile. It will trouble
the users.
Steps to reproduce:
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# run test
# cat trace_stat/function*
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > function_profile_enabled
# echo 1 > function_profile_enabled
# cat trace_stat/function*
# run test
# cat trace_stat/function*
So it is better that we initialize the ftrace profiler for each possible cpu
every time we enable the function profile instead of just the online ones.
Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com
Cc: stable@vger.kernel.org # 2.6.31+
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix the gathering of certificates from both the source tree and the build tree
to correctly calculate the pathnames of all the certificates.
The problem was that if the default generated cert, signing_key.x509, didn't
exist then it would not have a path attached and if it did, it would have a
path attached.
This means that the contents of kernel/.x509.list would change between the
first compilation in a directory and the second. After the second it would
remain stable because the signing_key.x509 file exists.
The consequence was that the kernel would get relinked unconditionally on the
second recompilation. The second recompilation would also show something like
this:
X.509 certificate list changed
CERTS kernel/x509_certificate_list
- Including cert /home/torvalds/v2.6/linux/signing_key.x509
AS kernel/system_certificates.o
LD kernel/built-in.o
which is why the relink would happen.
Unfortunately, it isn't a simple matter of just sticking a path on the front
of the filename of the certificate in the build directory as make can't then
work out how to build it.
So the path has to be prepended to the name for sorting and duplicate
elimination and then removed for the make rule if it is in the build tree.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIVAwUAUqdgihOxKuMESys7AQIishAAjGG3LnEp12fd7oay5//4SEg31ybPqGj7
Xotk9mblBW7cWPbLqk7xyIhxpIj2zM14XatEV2TfBNZV0Lcrzr1M5s87ZRUX0ui+
FHbtfn/M3Or8AvX7W1HZgG2Se7M0Ba6Y2RRXNsEdgqk6KMQuHhPEZ2FZ5vCx6BAD
vXzxWIKVNeqMXR7z6xkyqmpztnVQs+ZgzE9+c96QKyhBdtBy4spDmfgJS90m0pjN
HhgONpdfgosknj8yu43rWIQvd3UUO5BVntCeic94Fbgh77ZAEi0dD7ifLz/ebcja
pfJfPXxYzkCfIrSdAzQF8iYnQ+5rRomvGsMcvtq6mBooah/YmEkmKpKJDTp47wyN
IEUeJaxM2Qp2jcUSjEd7lY9o1AK4sj+90cKeVRUd5kzZP1iolkQHR+dGiAwqbn6w
MJBn9fCamvhpsZDhl2G/DICprFDvErd9zAlNubivggJmITnPXNWx+K1RDL4fO4qc
zLHTGHkxYyACM/7oJfwbH/NyZ1yu53OlE3R2h6TA6ISc7nAed7qzGAZyrSV92Quc
pItGaQ/zNa0sbe+nufCx9FOWu8sA3x7qnazLxhVtlPde9nlxcpGo5cagqcyc4hyp
/IGK2JoGkgHvehECY8miJlsu7UqmThIhmy6o4T4X6ErX1M9ifR92WGeXkAi/2mh/
WciVpPvTrqM=
=/AdA
-----END PGP SIGNATURE-----
Merge tag 'keys-devel-20131210' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull misc keyrings fixes from David Howells:
"These break down into five sets:
- A patch to error handling in the big_key type for huge payloads.
If the payload is larger than the "low limit" and the backing store
allocation fails, then big_key_instantiate() doesn't clear the
payload pointers in the key, assuming them to have been previously
cleared - but only one of them is.
Unfortunately, the garbage collector still calls big_key_destroy()
when sees one of the pointers with a weird value in it (and not
NULL) which it then tries to clean up.
- Three patches to fix the keyring type:
* A patch to fix the hash function to correctly divide keyrings off
from keys in the topology of the tree inside the associative
array. This is only a problem if searching through nested
keyrings - and only if the hash function incorrectly puts the a
keyring outside of the 0 branch of the root node.
* A patch to fix keyrings' use of the associative array. The
__key_link_begin() function initially passes a NULL key pointer
to assoc_array_insert() on the basis that it's holding a place in
the tree whilst it does more allocation and stuff.
This is only a problem when a node contains 16 keys that match at
that level and we want to add an also matching 17th. This should
easily be manufactured with a keyring full of keyrings (without
chucking any other sort of key into the mix) - except for (a)
above which makes it on average adding the 65th keyring.
* A patch to fix searching down through nested keyrings, where any
keyring in the set has more than 16 keyrings and none of the
first keyrings we look through has a match (before the tree
iteration needs to step to a more distal node).
Test in keyutils test suite:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/keyutils.git/commit/?id=8b4ae963ed92523aea18dfbb8cab3f4979e13bd1
- A patch to fix the big_key type's use of a shmem file as its
backing store causing audit messages and LSM check failures. This
is done by setting S_PRIVATE on the file to avoid LSM checks on the
file (access to the shmem file goes through the keyctl() interface
and so is gated by the LSM that way).
This isn't normally a problem if a key is used by the context that
generated it - and it's currently only used by libkrb5.
Test in keyutils test suite:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/keyutils.git/commit/?id=d9a53cbab42c293962f2f78f7190253fc73bd32e
- A patch to add a generated file to .gitignore.
- A patch to fix the alignment of the system certificate data such
that it it works on s390. As I understand it, on the S390 arch,
symbols must be 2-byte aligned because loading the address discards
the least-significant bit"
* tag 'keys-devel-20131210' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
KEYS: correct alignment of system_certificate_list content in assembly file
Ignore generated file kernel/x509_certificate_list
security: shmem: implement kernel private shmem inodes
KEYS: Fix searching of nested keyrings
KEYS: Fix multiple key add into associative array
KEYS: Fix the keyring hash function
KEYS: Pre-clear struct key on allocation
When debugging the read-only hugepage case, I was confused by the fact
that get_futex_key() did an access_ok() only for the non-shared futex
case, since the user address checking really isn't in any way specific
to the private key handling.
Now, it turns out that the shared key handling does effectively do the
equivalent checks inside get_user_pages_fast() (it doesn't actually
check the address range on x86, but does check the page protections for
being a user page). So it wasn't actually a bug, but the fact that we
treat the address differently for private and shared futexes threw me
for a loop.
Just move the check up, so that it gets done for both cases. Also, use
the 'rw' parameter for the type, even if it doesn't actually matter any
more (it's a historical artifact of the old racy i386 "page faults from
kernel space don't check write protections").
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The hugepage code had the exact same bug that regular pages had in
commit 7485d0d375 ("futexes: Remove rw parameter from
get_futex_key()").
The regular page case was fixed by commit 9ea71503a8 ("futex: Fix
regression with read only mappings"), but the transparent hugepage case
(added in a5b338f2b0: "thp: update futex compound knowledge") case
remained broken.
Found by Dave Jones and his trinity tool.
Reported-and-tested-by: Dave Jones <davej@fedoraproject.org>
Cc: stable@kernel.org # v2.6.38+
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
results in him occasionally seeing time going backwards - which
crashes the scheduler ...
Most of our time accounting can actually handle that except the most
common one; the tick time update of sched_fair.
There is a further problem with that code; previously we assumed that
because we get a tick every TICK_NSEC our time delta could never
exceed 32bits and math was simpler.
However, ever since Frederic managed to get NO_HZ_FULL merged; this is
no longer the case since now a task can run for a long time indeed
without getting a tick. It only takes about ~4.2 seconds to overflow
our u32 in nanoseconds.
This means we not only need to better deal with time going backwards;
but also means we need to be able to deal with large deltas.
This patch reworks the entire code and uses mul_u64_u32_shr() as
proposed by Andy a long while ago.
We express our virtual time scale factor in a u32 multiplier and shift
right and the 32bit mul_u64_u32_shr() implementation reduces to a
single 32x32->64 multiply if the time delta is still short (common
case).
For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
Reported-and-Tested-by: Christian Engelmayer <cengelma@gmx.at>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: fweisbec@gmail.com
Cc: Paul Turner <pjt@google.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
Make sure to always initialize power_orig now that we actually use it.
Ideally build_sched_domains() -> init_sched_groups_power() would also
initialize this; but for some yet unexplained reason some setups seem
to miss updates there.
Reported-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Apart from data-type specific alignment constraints, there are also
architecture-specific alignment requirements.
For example, on s390 symbols must be on even addresses implying a 2-byte
alignment. If the system_certificate_list_end symbol is on an odd address
and if this address is loaded, the least-significant bit is ignored. As a
result, the load_system_certificate_list() fails to load the certificates
because of a wrong certificate length calculation.
To be safe, align system_certificate_list on an 8-byte boundary. Also improve
the length calculation of the system_certificate_list content. Introduce a
system_certificate_list_size (8-byte aligned because of unsigned long) variable
that stores the length. Let the linker calculate this size by introducing
a start and end label for the certificate content.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: David Howells <dhowells@redhat.com>
$ git status
# On branch pending-rebases
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# kernel/x509_certificate_list
nothing added to commit but untracked files present (use "git add" to track)
$
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David Howells <dhowells@redhat.com>
Add a flag to tell the PCI subsystem that kernel is shutting down in
preparation to kexec a kernel. Add code in PCI subsystem to use this flag
to clear Bus Master bit on PCI devices only in case of kexec reboot.
This fixes a power-off problem on Acer Aspire V5-573G and likely other
machines and avoids any other issues caused by clearing Bus Master bit on
PCI devices in normal shutdown path. The problem was introduced by
b566a22c23 ("PCI: disable Bus Master on PCI device shutdown").
This patch is based on discussion at
http://marc.info/?l=linux-pci&m=138425645204355&w=2
Link: https://bugzilla.kernel.org/show_bug.cgi?id=63861
Reported-by: Chang Liu <cl91tp@gmail.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: stable@vger.kernel.org # v3.5+
ae7f164a09 ("cgroup: move cgroup->subsys[] assignment to
online_css()") moved cgroup->subsys[] assignements later in
cgroup_create() but didn't update error handling path accordingly
leading to the following oops and leaking later css's after an
online_css() failure. The oops is from cgroup destruction path being
invoked on the partially constructed cgroup which is not ready to
handle empty slots in cgrp->subsys[] array.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
PGD a780a067 PUD aadbe067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in:
CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
Hardware name:
task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
RIP: 0010:[<ffffffff810eeaa8>] [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
RSP: 0018:ffff8800a781bd98 EFLAGS: 00010282
RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
FS: 00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
Stack:
ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
Call Trace:
[<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
[<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
[<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
[<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
[<ffffffff8169e569>] system_call_fastpath+0x16/0x1b
This patch moves reference bumping inside online_css() loop, clears
css_ar[] as css's are brought online successfully, and updates
err_destroy path so that either a css is fully online and destroyed by
cgroup_destroy_locked() or the error path frees it. This creates a
duplicate css free logic in the error path but it will be cleaned up
soon.
v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
invoked with a cgroup which doesn't have all css's populated.
Update cgroup_destroy_locked() so that it skips NULL css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Reported-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: stable@vger.kernel.org # v3.12+
all events. This was prevalent when FTRACE_SELFTEST was enabled which
enables all events several times, and caused the system bootup to
pause for over a minute.
This was tracked down to an addition of a synchronize_sched() performed
when system call tracepoints are unregistered.
The synchronize_sched() is needed between the unregistering of the
system call tracepoint and a deletion of a tracing instance buffer.
But placing the synchronize_sched() in the unreg of *every* system call
tracepoint is a bit overboard. A single synchronize_sched() before
the deletion of the instance is sufficient.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSofYNAAoJEKQekfcNnQGuTUcIAJOx8745KlkFjN4VX+nCWNfP
xrrpnLBymbGMA9lQ4fXk+kdiuhH8DjRYdKq9fU4T481MFYKkToUIZH6NeaLI5fr1
0nBPPjVyAlJ+yt9JbOAYa1jEYnAr27ORDHEtdQnqb6OJSky3oh9jCQi+toxmh2qX
Sv1tIYeAf3K2V/h5xt6uSl9oiZ6KBtwE3f+xkHWNizaU9i2rq2gxd77fSbPNTIps
wLdsESYziA2UeAm13eh8xXo1uqRbfvx7bPr59cu0+3AqdOoaXFG+JoE/MqmljIO5
HkyCKJVqP8HD+QieuEB+hw4zpSsl6A6iKSlaiHm8NMnRRocfM6qMKMQXQ28KhxA=
=sYm/
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"A regression showed up that there's a large delay when enabling all
events. This was prevalent when FTRACE_SELFTEST was enabled which
enables all events several times, and caused the system bootup to
pause for over a minute.
This was tracked down to an addition of a synchronize_sched()
performed when system call tracepoints are unregistered.
The synchronize_sched() is needed between the unregistering of the
system call tracepoint and a deletion of a tracing instance buffer.
But placing the synchronize_sched() in the unreg of *every* system
call tracepoint is a bit overboard. A single synchronize_sched()
before the deletion of the instance is sufficient"
* tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Only run synchronize_sched() at instance deletion time