This fixes:
incompatible pointer type: => 89
arch/um/kernel/exec.c: warning: passing argument 2 of 'execve1' from
incompatible pointer type: => 69, 85
arch/um/kernel/exec.c: warning: passing argument 3 of 'execve1' from
incompatible pointer type: => 69, 85
which was introduced by d7627467b7 ("Make do_execve() take a const
filename pointer")
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the regression caused by the commit 6fee48cd33
("dma-mapping: arm: use generic pci_set_dma_mask and
pci_set_consistent_dma_mask").
ARM needs to clip the dma coherent mask for dmabounce devices. This
restores the old trick.
Note that strictly speaking, the DMA API doesn't allow architectures to do
such but I'm not sure it's worth adding the new API to set the dma mask
that allows architectures to clip it.
Reported-by: Krzysztof Halasa <khc@pm.waw.pl>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
sparc: Prevent no-handler signal syscall restart recursion.
sparc: Don't mask signal when we can't setup signal frame.
sparc64: Fix race in signal instruction flushing.
sparc64: Support RAW perf events.
Make sigreturn zero regs->trap, make do_signal() do the same on all
paths. As it is, signal interrupting e.g. read() from fd 512 (==
ERESTARTSYS) with another signal getting unblocked when the first
handler finishes will lead to restart one insn earlier than it ought
to. Same for multiple signals with in-kernel handlers interrupting
that sucker at the same time. Same for multiple signals of any kind
interrupting that sucker on 64bit...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Explicitly clear the "in-syscall" bit when we have no signal
handler and back up the program counters to back up the system
call.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't invoke the signal handler tracehook in that situation
either.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
lguest: update comments to reflect LHCALL_LOAD_GDT_ENTRY.
virtio: console: Prevent userspace from submitting NULL buffers
virtio: console: Fix poll blocking even though there is data to read
If another cpu does a very wide munmap() on the signal frame area,
it can tear down the page table hierarchy from underneath us.
Borrow an idea from the 64-bit fault path's get_user_insn(), and
disable cross call interrupts during the page table traversal
to lock them in place while we operate.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
We used to have a hypercall which reloaded the entire GDT, then we
switched to one which loaded a single entry (to match the IDT code).
Some comments were not updated, so fix them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reported by: Eviatar Khen <eviatarkhen@gmail.com>
We need to make sure that only the first do_signal() to be handled on
the way out syscall will bother with syscall restarts; additionally, the
check on the "signal has user handler" path had been wrong - compare
with restart prevention in sigreturn()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_signal() should place the syscall number in gr7, not gr8 when
handling ERESTART_WOULDBLOCK.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use force_sigsegv() rather than force_sig(SIGSEGV, ...) as the former
resets the SEGV handler pointer which will kill the process, rather than
leaving it open to an infinite loop if the SEGV handler itself caused a
SEGV signal.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a) sa_handler might be maliciously set to point to kernel memory;
blindly dereferencing it in FDPIC case is a Bad Idea(tm).
b) I'm not sure you need that set_fs(USER_DS) there at all, but if you
do, you'd better do it *before* checking the frame you've decided to
use with access_ok(), lest sigaltstack() becomes a convenient
roothole.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reset restart_block.fn on executing a sigreturn such that any currently
pending system call restarts will be forced to return -EINTR.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha-2.6:
alpha: deal with multiple simultaneously pending signals
alpha: fix a 14 years old bug in sigreturn tracing
alpha: unb0rk sigsuspend() and rt_sigsuspend()
alpha: belated ERESTART_RESTARTBLOCK race fix
alpha: Shift perf event pending work earlier in timer interrupt
alpha: wire up fanotify and prlimit64 syscalls
alpha: kill big kernel lock
alpha: fix build breakage in asm/cacheflush.h
alpha: remove unnecessary cast from void* in assignment.
alpha: Use static const char * const where possible
Unlike the other targets, alpha sets _one_ sigframe and
buggers off until the next syscall/interrupt, even if
more signals are pending. It leads to quite a few unpleasant
inconsistencies, starting with SIGSEGV potentially arriving
not where it should and including e.g. mess with sigsuspend();
consider two pending signals blocked until sigsuspend()
unblocks them. We pick the first one; then, if we are hit
by interrupt while in the handler, we process the second one
as well. If we are not, and if no syscalls had been made,
we get out of the first handler and leave the second signal
pending; normally sigreturn() would've picked it anyway, but
here it starts with restoring the original mask and voila -
the second signal is blocked again. On everything else we
get both delivered consistently.
It's actually easy to fix; the only thing to watch out for
is prevention of double syscall restart. Fortunately, the
idea I've nicked from arm fix by rmk works just fine...
Testcase demonstrating the behaviour in question; on alpha
we get one or both flags set (usually one), on everything
else both are always set.
#include <signal.h>
#include <stdio.h>
int had1, had2;
void f1(int sig) { had1 = 1; }
void f2(int sig) { had2 = 1; }
main()
{
sigset_t set1, set2;
sigemptyset(&set1);
sigemptyset(&set2);
sigaddset(&set2, 1);
sigaddset(&set2, 2);
signal(1, f1);
signal(2, f2);
sigprocmask(SIG_SETMASK, &set2, NULL);
raise(1);
raise(2);
sigsuspend(&set1);
printf("had1:%d had2:%d\n", had1, had2);
}
Tested-by: Michael Cree <mcree@orcon.net.nz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matt Turner <mattst88@gmail.com>
The way sigreturn() is implemented on alpha breaks PTRACE_SYSCALL,
all way back to 1.3.95 when alpha has grown PTRACE_SYSCALL support.
What happens is direct return to ret_from_syscall, in order to bypass
mangling of a3 (error indicator) and prevent other mutilations of
registers (e.g. by syscall restart). That's fine, but... the entire
TIF_SYSCALL_TRACE codepath is kept separate on alpha and post-syscall
stopping/notifying the tracer is after the syscall. And the normal
path we are forcibly switching to doesn't have it.
So we end up with *one* stop in traced sigreturn() vs. two in other
syscalls. And yes, strace is visibly broken by that; try to strace
the following
#include <signal.h>
#include <stdio.h>
void f(int sig) {}
main()
{
signal(SIGHUP, f);
raise(SIGHUP);
write(1, "eeeek\n", 6);
}
and watch the show. The
close(1) = 405
in the end of strace output is coming from return value of write() (6 ==
__NR_close on alpha) and syscall number of exit_group() (__NR_exit_group ==
405 there).
The fix is fairly simple - the only thing we end up missing is the call
of syscall_trace() and we can tell whether we'd been called from the
SYSCALL_TRACE path by checking ra value. Since we are setting the
switch_stack up (that's what sys_sigreturn() does), we have the right
environment for calling syscall_trace() - just before we call
undo_switch_stack() and return. Since undo_switch_stack() will overwrite
s0 anyway, we can use it to store the result of "has it been called from
SYSCALL_TRACE path?" check. The same thing applies in rt_sigreturn().
Tested-by: Michael Cree <mcree@orcon.net.nz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matt Turner <mattst88@gmail.com>
Old code used to set regs->r0 and regs->r19 to force the right
return value. Leaving that after switch to ERESTARTNOHAND
was a Bad Idea(tm), since now that screws the restart - if we
hit the case when get_signal_to_deliver() returns 0, we will
step back to syscall insn, with v0 set to EINTR and a3 to 1.
The latter won't matter, since EINTR is 4, aka __NR_write.
Testcase:
#include <signal.h>
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
main()
{
sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGCONT);
sigprocmask(SIG_SETMASK, &mask, NULL);
kill(0, SIGCONT);
syscall(__NR_sigsuspend, 1, "b0rken\n", 7);
}
results on alpha in immediate message to stdout...
Fix is obvious; moreover, since we don't need regs anymore, we can
switch to normal prototypes for these guys and lose the wrappers.
Even better, rt_sigsuspend() is identical to generic version in
kernel/signal.c now.
Tested-by: Michael Cree <mcree@orcon.net.nz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matt Turner <mattst88@gmail.com>
same thing as had been done on other targets back in 2003 -
move setting ->restart_block.fn into {rt_,}sigreturn().
Tested-by: Michael Cree <mcree@orcon.net.nz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matt Turner <mattst88@gmail.com>
Pending work from the performance event subsystem is executed in
the timer interrupt. This patch shifts the call to
perf_event_do_pending() before the call to update_process_times()
as the latter may call back into the perf event subsystem and it
is prudent to have the pending work executed first.
Signed-off-by: Michael Cree <mcree@orcon.net.nz>
Signed-off-by: Matt Turner <mattst88@gmail.com>
The 2.6.36-rc kernel added three new system calls:
fanotify_init, fanotify_mark, and prlimit64. This
patch wires them up on Alpha.
Built and booted on an XP900. Untested beyond that.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Matt Turner <mattst88@gmail.com>
All uses of the BKL on alpha are totally bogus, nothing
is really protected by this. Remove the remaining users
so we don't have to mark alpha as 'depends on BKL'.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: linux-alpha@vger.kernel.org
Signed-off-by: Matt Turner <mattst88@gmail.com>
Alpha SMP flush_icache_user_range() is implemented as an inline
function inside include/asm/cacheflush.h. It dereferences @current
but doesn't include linux/sched.h and thus causes build failure if
linux/sched.h wasn't included previously. Fix it by including the
needed header file explicitly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Matt Turner <mattst88@gmail.com>
Add IORESOURCE_IRQ_HIGHLEVEL irq flag to dm9000 driver
platform data in board mach-real6410.
Signed-off-by: Darius Augulis <augulis.darius@gmail.com>
[kgene.kim@samsung.com: minor title fix]
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
Fix errors reported by checkpatch.pl script
Signed-off-by: Darius Augulis <augulis.darius@gmail.com>
[kgene.kim@samsung.com: minor title fix]
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
Avoids build warnings due to the undeclared non-statics.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
If a signal hits us outside of a syscall and another gets delivered
when we are in sigreturn (e.g. because it had been in sa_mask for
the first one and got sent to us while we'd been in the first handler),
we have a chance of returning from the second handler to location one
insn prior to where we ought to return. If r0 happens to contain -513
(-ERESTARTNOINTR), sigreturn will get confused into doing restart
syscall song and dance.
Incredible joy to debug, since it manifests as random, infrequent and
very hard to reproduce double execution of instructions in userland
code...
The fix is simple - mark it "don't bother with restarts" in wrapper,
i.e. set r8 to 0 in sys_sigreturn and sys_rt_sigreturn wrappers,
suppressing the syscall restart handling on return from these guys.
They can't legitimately return a restart-worthy error anyway.
Testcase:
#include <unistd.h>
#include <signal.h>
#include <stdlib.h>
#include <sys/time.h>
#include <errno.h>
void f(int n)
{
__asm__ __volatile__(
"ldr r0, [%0]\n"
"b 1f\n"
"b 2f\n"
"1:b .\n"
"2:\n" : : "r"(&n));
}
void handler1(int sig) { }
void handler2(int sig) { raise(1); }
void handler3(int sig) { exit(0); }
main()
{
struct sigaction s = {.sa_handler = handler2};
struct itimerval t1 = { .it_value = {1} };
struct itimerval t2 = { .it_value = {2} };
signal(1, handler1);
sigemptyset(&s.sa_mask);
sigaddset(&s.sa_mask, 1);
sigaction(SIGALRM, &s, NULL);
signal(SIGVTALRM, handler3);
setitimer(ITIMER_REAL, &t1, NULL);
setitimer(ITIMER_VIRTUAL, &t2, NULL);
f(-513); /* -ERESTARTNOINTR */
write(1, "buggered\n", 9);
return 1;
}
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: hpet: Work around hardware stupidity
x86, build: Disable -fPIE when compiling with CONFIG_CC_STACKPROTECTOR=y
x86, cpufeature: Suppress compiler warning with gcc 3.x
x86, UV: Fix initialization of max_pnode
Lengths and types of breakpoints are encoded in a half byte
into CPU registers. However when we extract these values
and store them, we add a high half byte part to them: 0x40 to the
length and 0x80 to the type.
When that gets reloaded to the CPU registers, the high part
is masked.
While making the instruction breakpoints available for perf,
I zapped that high part on instruction breakpoint encoding
and that broke the arch -> generic translation used by ptrace
instruction breakpoints. Writing dr7 to set an inst breakpoint
was then failing.
There is no apparent reason for these high parts so we could get
rid of them altogether. That's an invasive change though so let's
do that later and for now fix the problem by restoring that inst
breakpoint high part encoding in this sole patch.
Reported-by: Kelvie Wong <kelvie@ieee.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
* 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
arch/tile: fix formatting bug in register dumps
arch/tile: fix memcpy_fromio()/memcpy_toio() signatures
arch/tile: Save and restore extra user state for tilegx
arch/tile: Change struct sigcontext to be more useful
arch/tile: finish const-ifying sys_execve()
This patch adds CPU type detection for the Intel Celeron 540, which is
part of the Core 2 family according to Wikipedia; the family and ID pair
is absent from the Volume 3B table referenced in the source code
comments. I have tested this patch on an Intel Celeron 540 machine
reporting itself as Family 6 Model 22, and OProfile runs on the machine
without issue.
Spec:
http://download.intel.com/design/mobile/SPECUPDT/317667.pdf
Signed-off-by: Patrick Simmons <linuxrocks123@netscape.net>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: stable@kernel.org
Signed-off-by: Robert Richter <robert.richter@amd.com>
Tony's fix (f574c84319) has a small bug,
it incorrectly uses "r3" as a scratch register in the first of the two
unlock paths ... it is also inefficient. Optimize the fast path again.
Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This cut-and-paste bug was caused by rewriting the register dump
code to use only a single printk per line of output.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
During context switch, save and restore a couple of additional bits of
tilegx user state that can be persistently modified by userspace.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Rather than just using pt_regs, it now contains the actual saved
state explicitly, similar to pt_regs. By doing it this way, we
provide a cleaner API for userspace (or equivalently, we avoid the
need for libc to provide its own definition of sigcontext).
While we're at it, move PT_FLAGS_xxx to where they are not visible
from userspace. And always pass siginfo and mcontext to signal
handlers, even if they claim they don't need it, since sometimes
they actually try to use it anyway in practice.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
The sys_execve() implementation was properly const-ified but not
the declaration, the syscall wrappers, or the compat version.
This change completes the constification process.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
* ssh://master.kernel.org/home/hpa/tree/sec:
x86-64, compat: Retruncate rax after ia32 syscall entry tracing
x86-64, compat: Test %rax for the syscall number, not %eax
compat: Make compat_alloc_user_space() incorporate the access_ok()
Fix up the IRQ names for the MN10300 on-chip serial ports in the driver as
request_interrupt() no longer allows names containing slashes, giving a warning
like the following if one is encountered:
------------[ cut here ]------------
WARNING: at fs/proc/generic.c:323 __xlate_proc_name+0x62/0x7c()
name 'ttySM0/Rx'
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit d4d6715, we reopened an old hole for a 64-bit ptracer touching a
32-bit tracee in system call entry. A %rax value set via ptrace at the
entry tracing stop gets used whole as a 32-bit syscall number, while we
only check the low 32 bits for validity.
Fix it by truncating %rax back to 32 bits after syscall_trace_enter,
in addition to testing the full 64 bits as has already been added.
Reported-by: Ben Hawkes <hawkes@sota.gen.nz>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
On 64 bits, we always, by necessity, jump through the system call
table via %rax. For 32-bit system calls, in theory the system call
number is stored in %eax, and the code was testing %eax for a valid
system call number. At one point we loaded the stored value back from
the stack to enforce zero-extension, but that was removed in checkin
d4d6715016. An actual 32-bit process
will not be able to introduce a non-zero-extended number, but it can
happen via ptrace.
Instead of re-introducing the zero-extension, test what we are
actually going to use, i.e. %rax. This only adds a handful of REX
prefixes to the code.
Reported-by: Ben Hawkes <hawkes@sota.gen.nz>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@kernel.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
compat_alloc_user_space() expects the caller to independently call
access_ok() to verify the returned area. A missing call could
introduce problems on some architectures.
This patch incorporates the access_ok() check into
compat_alloc_user_space() and also adds a sanity check on the length.
The existing compat_alloc_user_space() implementations are renamed
arch_compat_alloc_user_space() and are used as part of the
implementation of the new global function.
This patch assumes NULL will cause __get_user()/__put_user() to either
fail or access userspace on all architectures. This should be
followed by checking the return value of compat_access_user_space()
for NULL in the callers, at which time the access_ok() in the callers
can also be removed.
Reported-by: Ben Hawkes <hawkes@sota.gen.nz>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Bottomley <jejb@parisc-linux.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: <stable@kernel.org>
This more or less reverts commits 08be979 (x86: Force HPET
readback_cmp for all ATI chipsets) and 30a564be (x86, hpet: Restrict
read back to affected ATI chipsets) to the status of commit 8da854c
(x86, hpet: Erratum workaround for read after write of HPET
comparator).
The delta to commit 8da854c is mostly comments and the change from
WARN_ONCE to printk_once as we know the call path of this function
already.
This needs really in depth explanation:
First of all the HPET design is a complete failure. Having a counter
compare register which generates an interrupt on matching values
forces the software to do at least one superfluous readback of the
counter register.
While it is nice in theory to program "absolute" time events it is
practically useless because the timer runs at some absurd frequency
which can never be matched to real world units. So we are forced to
calculate a relative delta and this forces a readout of the actual
counter value, adding the delta and programming the compare
register. When the delta is small enough we run into the danger that
we program a compare value which is already in the past. Due to the
compare for equal nature of HPET we need to read back the counter
value after writing the compare rehgister (btw. this is necessary for
absolute timeouts as well) to make sure that we did not miss the timer
event. We try to work around that by setting the minimum delta to a
value which is larger than the theoretical time which elapses between
the counter readout and the compare register write, but that's only
true in theory. A NMI or SMI which hits between the readout and the
write can easily push us beyond that limit. This would result in
waiting for the next HPET timer interrupt until the 32bit wraparound
of the counter happens which takes about 306 seconds.
So we designed the next event function to look like:
match = read_cnt() + delta;
write_compare_ref(match);
return read_cnt() < match ? 0 : -ETIME;
At some point we got into trouble with certain ATI chipsets. Even the
above "safe" procedure failed. The reason was that the write to the
compare register was delayed probably for performance reasons. The
theory was that they wanted to avoid the synchronization of the write
with the HPET clock, which is understandable. So the write does not
hit the compare register directly instead it goes to some intermediate
register which is copied to the real compare register in sync with the
HPET clock. That opens another window for hitting the dreaded "wait
for a wraparound" problem.
To work around that "optimization" we added a read back of the compare
register which either enforced the update of the just written value or
just delayed the readout of the counter enough to avoid the issue. We
unfortunately never got any affirmative info from ATI/AMD about this.
One thing is sure, that we nuked the performance "optimization" that
way completely and I'm pretty sure that the result is worse than
before some HW folks came up with those.
Just for paranoia reasons I added a check whether the read back
compare register value was the same as the value we wrote right
before. That paranoia check triggered a couple of years after it was
added on an Intel ICH9 chipset. Venki added a workaround (commit
8da854c) which was reading the compare register twice when the first
check failed. We considered this to be a penalty in general and
restricted the readback (thus the wasted CPU cycles) to the known to
be affected ATI chipsets.
This turned out to be a utterly wrong decision. 2.6.35 testers
experienced massive problems and finally one of them bisected it down
to commit 30a564be which spured some further investigation.
Finally we got confirmation that the write to the compare register can
be delayed by up to two HPET clock cycles which explains the problems
nicely. All we can do about this is to go back to Venki's initial
workaround in a slightly modified version.
Just for the record I need to say, that all of this could have been
avoided if hardware designers and of course the HPET committee would
have thought about the consequences for a split second. It's out of my
comprehension why designing a working timer is so hard. There are two
ways to achieve it:
1) Use a counter wrap around aware compare_reg <= counter_reg
implementation instead of the easy compare_reg == counter_reg
Downsides:
- It needs more silicon.
- It needs a readout of the counter to apply a relative
timeout. This is necessary as the counter does not run in
any useful (and adjustable) frequency and there is no
guarantee that the counter which is used for timer events is
the same which is used for reading the actual time (and
therefor for calculating the delta)
Upsides:
- None
2) Use a simple down counter for relative timer events
Downsides:
- Absolute timeouts are not possible, which is not a problem
at all in the context of an OS and the expected
max. latencies/jitter (also see Downsides of #1)
Upsides:
- It needs less or equal silicon.
- It works ALWAYS
- It is way faster than a compare register based solution (One
write versus one write plus at least one and up to four
reads)
I would not be so grumpy about all of this, if I would not have been
ignored for many years when pointing out these flaws to various
hardware folks. I really hate timers (at least those which seem to be
designed by janitors).
Though finally we got a reasonable explanation plus a solution and I
want to thank all the folks involved in chasing it down and providing
valuable input to this.
Bisected-by: Nix <nix@esperi.org.uk>
Reported-by: Artur Skawina <art.08.09@gmail.com>
Reported-by: Damien Wyart <damien.wyart@free.fr>
Reported-by: John Drescher <drescherjm@gmail.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The irqs.h usage here got missed in the Samsung platform reorganisation.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Acked-by: Jassi Brar <jassi.brar@samsung.com>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>