The existing timekeeping_adjust logic has always been complicated
to understand. Further, since it was developed prior to NOHZ becoming
common, its not surprising it performs poorly when NOHZ is enabled.
Since Miroslav pointed out the problematic nature of the existing code
in the NOHZ case, I've tried to refactor the code to perform better.
The problem with the previous approach was that it tried to adjust
for the total cumulative error using a scaled dampening factor. This
resulted in large errors to be corrected slowly, while small errors
were corrected quickly. With NOHZ the timekeeping code doesn't know
how far out the next tick will be, so this results in bad
over-correction to small errors, and insufficient correction to large
errors.
Inspired by Miroslav's patch, I've refactored the code to try to
address the correction in two steps.
1) Check the future freq error for the next tick, and if the frequency
error is large, try to make sure we correct it so it doesn't cause
much accumulated error.
2) Then make a small single unit adjustment to correct any cumulative
error that has collected over time.
This method performs fairly well in the simulator Miroslav created.
Major credit to Miroslav for pointing out the issue, providing the
original patch to resolve this, a simulator for testing, as well as
helping debug and resolve issues in my implementation so that it
performed closer to his original implementation.
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Reported-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In the GENERIC_TIME_VSYSCALL_OLD update_vsyscall implementation,
we take the tk_xtime() value, which returns a timespec64, and
store it in a timespec.
This luckily is ok, since the only architectures that use
GENERIC_TIME_VSYSCALL_OLD are ia64 and ppc64, which are both
64 bit systems where timespec64 is the same as a timespec.
Even so, for cleanliness reasons, use the conversion function
to assign the proper type.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Expose the new NMI safe accessor to clock monotonic to the tracer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Tracers want a correlated time between the kernel instrumentation and
user space. We really do not want to export sched_clock() to user
space, so we need to provide something sensible for this.
Using separate data structures with an non blocking sequence count
based update mechanism allows us to do that. The data structure
required for the readout has a sequence counter and two copies of the
timekeeping data.
On the update side:
smp_wmb();
tkf->seq++;
smp_wmb();
update(tkf->base[0], tk);
smp_wmb();
tkf->seq++;
smp_wmb();
update(tkf->base[1], tk);
On the reader side:
do {
seq = tkf->seq;
smp_rmb();
idx = seq & 0x01;
now = now(tkf->base[idx]);
smp_rmb();
} while (seq != tkf->seq)
So if a NMI hits the update of base[0] it will use base[1] which is
still consistent, but this timestamp is not guaranteed to be monotonic
across an update.
The timestamp is calculated by:
now = base_mono + clock_delta * slope
So if the update lowers the slope, readers who are forced to the
not yet updated second array are still using the old steeper slope.
tmono
^
| o n
| o n
| u
| o
|o
|12345678---> reader order
o = old slope
u = update
n = new slope
So reader 6 will observe time going backwards versus reader 5.
While other CPUs are likely to be able observe that, the only way
for a CPU local observation is when an NMI hits in the middle of
the update. Timestamps taken from that NMI context might be ahead
of the following timestamps. Callers need to be aware of that and
deal with it.
V2: Got rid of clock monotonic raw and reorganized the data
structures. Folded in the barrier fix from Mathieu.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
All the function needs is in the tk_read_base struct. No functional
change for the current code, just a preparatory patch for the NMI safe
accessor to clock monotonic which will use struct tk_read_base as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The members of the new struct are the required ones for the new NMI
safe accessor to clcok monotonic. In order to reuse the existing
timekeeping code and to make the update of the fast NMI safe
timekeepers a simple memcpy use the struct for the timekeeper as well
and convert all users.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Access to time requires to touch two cachelines at minimum
1) The timekeeper data structure
2) The clocksource data structure
The access to the clocksource data structure can be avoided as almost
all clocksource implementations ignore the argument to the read
callback, which is a pointer to the clocksource.
But the core needs to touch it to access the members @read and @mask.
So we are better off by copying the @read function pointer and the
@mask from the clocksource to the core data structure itself.
For the most used ktime_get() access all required data including the
@read and @mask copies fits together with the sequence counter into a
single 64 byte cacheline.
For the other time access functions we touch in the current code three
cache lines in the worst case. But with the clocksource data copies we
can reduce that to two adjacent cachelines, which is more efficient
than disjunct cache lines.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
cycle_last was added to the clocksource to support the TSC
validation. We moved that to the core code, so we can get rid of the
extra copy.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The only user of the cycle_last validation is the x86 TSC. In order to
provide NMI safe accessor functions for clock monotonic and
monotonic_raw we need to do that in the core.
We can't do the TSC specific
if (now < cycle_last)
now = cycle_last;
for the other wrapping around clocksources, but TSC has
CLOCKSOURCE_MASK(64) which actually does not mask out anything so if
now is less than cycle_last the subtraction will give a negative
result. So we can check for that in clocksource_delta() and return 0
for that case.
Implement and enable it for x86
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
We want to move the TSC sanity check into core code to make NMI safe
accessors to clock monotonic[_raw] possible. For this we need to
sanity check the delta calculation. Create a helper function and
convert all sites to use it.
[ Build fix from jstultz ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Provide a ktime_t based interface for raw monotonic time.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
timekeeping_clocktai() is not used in fast pathes, so the extra
timespec conversion is not problematic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Subtracting plain nsec values and converting to timespec is simpler
than the whole timespec math. Not really fastpath code, so the
division is not an issue.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
get_monotonic_boottime() is not used in fast pathes, so the extra
timespec conversion is not problematic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Converting cputime to timespec and timespec to nanoseconds makes no
sense. Use cputime_to_ns() and be done with it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Kill the timespec juggling and calculate with plain nanoseconds.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Simplify the timespec to nsec/usec conversions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Simplify the only user of this data by removing the timespec
conversion.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Required for moving drivers to the nanosecond based interfaces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
ktime based conversion function to map a monotonic time stamp to a
different CLOCK.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Provide a helper function which lets us implement ktime_t based
interfaces for real, boot and tai clocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Speed up ktime_get() by using ktime_t based data. Text size shrinks by
64 bytes on x8664.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The ktime_t based interfaces are used a lot in performance critical
code pathes. Add ktime_t based data so the interfaces don't have to
convert from the xtime/timespec based data.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
We already have a function which does the right thing, that also makes
sure that the coming ktime_t based cached values are getting updated.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
struct timekeeper is quite badly sorted for the hot readout path. Most
time access functions need to load two cache lines.
Rearrange it so ktime_get() and getnstimeofday() are happy with a
single cache line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
To convert callers of the core code to timespec64 we need to provide
the proper interfaces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Right now we have time related prototypes in 3 different header
files. Move it to a single timekeeping header file and move the core
internal stuff into a core private header.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Convert the core timekeeping logic to use timespec64s. This moves the
2038 issues out of the core logic and into all of the accessor
functions.
Future changes will need to push the timespec64s out to all
timekeeping users, but that can be done interface by interface.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Helper and conversion functions for timespec64.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
With the plain nanoseconds based ktime_t we can simply use
ktime_divns() instead of going through loops and hoops of
timespec/timeval conversion.
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The non-scalar ktime_t implementation is basically a timespec
which has to be changed to support dates past 2038 on 32bit
systems.
This patch removes the non-scalar ktime_t implementation, forcing
the scalar s64 nanosecond version on all architectures.
This may have additional performance overhead on some 32bit
systems when converting between ktime_t and timespec structures,
however the majority of 32bit systems (arm and i386) were already
using scalar ktime_t, so no performance regressions will be seen
on those platforms.
On affected platforms, I'm open to finding optimizations, including
avoiding converting to timespecs where possible.
[ tglx: We can now cleanup the ktime_t.tv64 mess, but thats a
different issue and we can throw a coccinelle script at it ]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Rather then having two similar but totally different implementations
that provide timekeeping state to the hrtimer code, try to unify the
two implementations to be more simliar.
Thus this clarifies ktime_get_update_offsets to
ktime_get_update_offsets_now and changes get_xtime... to
ktime_get_update_offsets_tick.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Provide a default stub function instead of having the extra
conditional. Cuts binary size on a m68k build by ~100 bytes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Create a module that allows udelay() to be executed to ensure that
it is delaying at least as long as requested (with a little bit of
error allowed).
There are some configurations which don't have reliably udelay
due to using a loop delay with cpufreq changes which should use
a counter time based delay instead. This test aims to identify
those configurations where timing is unreliable.
Signed-off-by: David Riley <davidriley@chromium.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTwvRvAAoJEHm+PkMAQRiG8CoIAJucWkj+MJFFoDXjR9hfI8U7
/WeQLJP0GpWGMXd2KznX9epCuw5rsuaPAxCy1HFDNOa7OtNYacWrsIhByxOIDLwL
YjDB9+fpMMPFWsr+LPJa8Ombh/TveCS77w6Pt5VMZFwvIKujiNK/C3MdxjReH5Gr
iTGm8x7nEs2D6L2+5sQVlhXot/97phxIlBSP6wPXEiaztNZ9/JZi905Xpgq+WU16
ZOA8MlJj1TQD4xcWyUcsQ5REwIOdQ6xxPF00wv/12RFela+Puy4JLAilnV6Mc12U
fwYOZKbUNBS8rjfDDdyX3sljV1L5iFFqKkW3WFdnv/z8ZaZSo5NupWuavDnifKw=
=6Q8o
-----END PGP SIGNATURE-----
Merge tag 'v3.16-rc5' into timers/core
Reason: Bring in upstream modifications, so the pending changes which
depend on them can be queued.
Pull cgroup fixes from Tejun Heo:
"Mostly fixes for the fallouts from the recent cgroup core changes.
The decoupled nature of cgroup dynamic hierarchy management
(hierarchies are created dynamically on mount but may or may not be
reused once unmounted depending on remaining usages) led to more
ugliness being added to kernfs.
Hopefully, this is the last of it"
* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: break kernfs active protection in cpuset_write_resmask()
cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
kernfs: introduce kernfs_pin_sb()
cgroup: fix mount failure in a corner case
cpuset,mempolicy: fix sleeping function called from invalid context
cgroup: fix broken css_has_online_children()
Pull workqueue fixes from Tejun Heo:
"Two workqueue fixes. Both are one liners. One fixes missing uevent
for workqueue files on sysfs. The other one fixes missing zeroing of
NUMA cpu masks which can lead to oopses among other things"
* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: zero cpumask of wq_numa_possible_cpumask on init
workqueue: fix dev_set_uevent_suppress() imbalance
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.
BUG: unable to handle kernel paging request at 0000000000001d08
IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
[<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
[<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
[<ffffffff811926f1>] new_slab+0x91/0x300
[<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
[<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
[<ffffffff810a3c78>] ? load_balance+0x218/0x890
[<ffffffff8101a679>] ? sched_clock+0x9/0x10
[<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
[<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
[<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
[<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
[<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
[<ffffffff8105d0ec>] do_fork+0xbc/0x360
[<ffffffff8105d3b6>] kernel_thread+0x26/0x30
[<ffffffff81086652>] kthreadd+0x2c2/0x300
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
[<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.
When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:
3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }
But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.
By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: bce903809a ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
Pull irq fixes from Thomas Gleixner:
"A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak
which I introduced in the last round of cleanups :("
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Fix memory leak when calling irq_free_hwirqs()
irqchip: spear_shirq: Fix interrupt offset
irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive
irqchip: armada-370-xp: Mask all interrupts during initialization.
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0
which makes it a no-op since the interrupt count to free is
decremented in itself.
Fixes: 7b6ef12625
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1404167084-8070-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
running:
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
kills the uprobe. Along the way he found some other minor bugs and clean ups
that he fixed up making it a total of 4 patches.
Doing unrelated work, I found that the reading of the ftrace trace
file disables all function tracer callbacks. This was fine when ftrace
was the only user, but now that it's used by perf and kprobes, this
is a bug where reading trace can disable kprobes and perf. A very unexpected
side effect and should be fixed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTtMugAAoJEKQekfcNnQGuRs8H/2HhNAy0F1pAFYT5tH2o0z/Z
z6NFn83FUUoesg/Bd+1Dk7VekIZIdo1JxQc67/Y0D0oylPPr31gmVvk2llFPdJV5
xuWiUOuMbDq4Eh3Een8yaOsNsGbcX0lgw9qJEyqAvhJMi5G4dyt3r/g+vFThAyqm
O8Uv74GBGmUmmGMyZuW2r2f2vSEANSXTLbzFqj54fV7FNms1B0MpZ/2AiRcEwCzi
9yMainwrO1VPVSrSJFkW8g4sNl5X1M4tiIT8wGN75YePJVK3FAH8LmUDfpau4+ae
/QGdjNkRDWpZ3mMJPbz3sAT7USnMlSZ5w80Rf/CZuZAe2Ncycvh9AhqcFV0+fj8=
=3h4s
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code
where running:
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
kills the uprobe. Along the way he found some other minor bugs and
clean ups that he fixed up making it a total of 4 patches.
Doing unrelated work, I found that the reading of the ftrace trace
file disables all function tracer callbacks. This was fine when
ftrace was the only user, but now that it's used by perf and kprobes,
this is a bug where reading trace can disable kprobes and perf. A
very unexpected side effect and should be fixed"
* tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove ftrace_stop/start() from reading the trace file
tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()
tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher()
uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone
tracing/uprobes: Revert "Support mix of ftrace and perf"