2006-03-21 23:37:42 +00:00
|
|
|
Table of contents
|
|
|
|
=================
|
|
|
|
|
|
|
|
Last updated: 20 December 2005
|
|
|
|
|
|
|
|
Contents
|
|
|
|
========
|
|
|
|
|
|
|
|
- Introduction
|
|
|
|
- Devices not appearing
|
|
|
|
- Finding patch that caused a bug
|
|
|
|
-- Finding using git-bisect
|
|
|
|
-- Finding it the old way
|
|
|
|
- Fixing the bug
|
|
|
|
|
|
|
|
Introduction
|
|
|
|
============
|
|
|
|
|
|
|
|
Always try the latest kernel from kernel.org and build from source. If you are
|
|
|
|
not confident in doing that please report the bug to your distribution vendor
|
|
|
|
instead of to a kernel developer.
|
|
|
|
|
|
|
|
Finding bugs is not always easy. Have a go though. If you can't find it don't
|
|
|
|
give up. Report as much as you have found to the relevant maintainer. See
|
|
|
|
MAINTAINERS for who that is for the subsystem you have worked on.
|
|
|
|
|
|
|
|
Before you submit a bug report read REPORTING-BUGS.
|
|
|
|
|
|
|
|
Devices not appearing
|
|
|
|
=====================
|
|
|
|
|
|
|
|
Often this is caused by udev. Check that first before blaming it on the
|
|
|
|
kernel.
|
|
|
|
|
|
|
|
Finding patch that caused a bug
|
|
|
|
===============================
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Finding using git-bisect
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
Using the provided tools with git makes finding bugs easy provided the bug is
|
|
|
|
reproducible.
|
|
|
|
|
|
|
|
Steps to do it:
|
|
|
|
- start using git for the kernel source
|
|
|
|
- read the man page for git-bisect
|
|
|
|
- have fun
|
|
|
|
|
|
|
|
Finding it the old way
|
|
|
|
----------------------
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
[Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)]
|
|
|
|
|
|
|
|
This is how to track down a bug if you know nothing about kernel hacking.
|
|
|
|
It's a brute force approach but it works pretty well.
|
|
|
|
|
|
|
|
You need:
|
|
|
|
|
|
|
|
. A reproducible bug - it has to happen predictably (sorry)
|
|
|
|
. All the kernel tar files from a revision that worked to the
|
|
|
|
revision that doesn't
|
|
|
|
|
|
|
|
You will then do:
|
|
|
|
|
|
|
|
. Rebuild a revision that you believe works, install, and verify that.
|
|
|
|
. Do a binary search over the kernels to figure out which one
|
|
|
|
introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but
|
|
|
|
you know that 1.3.69 does. Pick a kernel in the middle and build
|
|
|
|
that, like 1.3.50. Build & test; if it works, pick the mid point
|
|
|
|
between .50 and .69, else the mid point between .28 and .50.
|
|
|
|
. You'll narrow it down to the kernel that introduced the bug. You
|
|
|
|
can probably do better than this but it gets tricky.
|
|
|
|
|
|
|
|
. Narrow it down to a subdirectory
|
|
|
|
|
|
|
|
- Copy kernel that works into "test". Let's say that 3.62 works,
|
|
|
|
but 3.63 doesn't. So you diff -r those two kernels and come
|
|
|
|
up with a list of directories that changed. For each of those
|
|
|
|
directories:
|
|
|
|
|
|
|
|
Copy the non-working directory next to the working directory
|
|
|
|
as "dir.63".
|
|
|
|
One directory at time, try moving the working directory to
|
|
|
|
"dir.62" and mv dir.63 dir"time, try
|
|
|
|
|
|
|
|
mv dir dir.62
|
|
|
|
mv dir.63 dir
|
|
|
|
find dir -name '*.[oa]' -print | xargs rm -f
|
|
|
|
|
|
|
|
And then rebuild and retest. Assuming that all related
|
|
|
|
changes were contained in the sub directory, this should
|
|
|
|
isolate the change to a directory.
|
|
|
|
|
|
|
|
Problems: changes in header files may have occurred; I've
|
|
|
|
found in my case that they were self explanatory - you may
|
|
|
|
or may not want to give up when that happens.
|
|
|
|
|
|
|
|
. Narrow it down to a file
|
|
|
|
|
|
|
|
- You can apply the same technique to each file in the directory,
|
|
|
|
hoping that the changes in that file are self contained.
|
|
|
|
|
|
|
|
. Narrow it down to a routine
|
|
|
|
|
|
|
|
- You can take the old file and the new file and manually create
|
|
|
|
a merged file that has
|
|
|
|
|
|
|
|
#ifdef VER62
|
|
|
|
routine()
|
|
|
|
{
|
|
|
|
...
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
routine()
|
|
|
|
{
|
|
|
|
...
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
And then walk through that file, one routine at a time and
|
|
|
|
prefix it with
|
|
|
|
|
|
|
|
#define VER62
|
|
|
|
/* both routines here */
|
|
|
|
#undef VER62
|
|
|
|
|
|
|
|
Then recompile, retest, move the ifdefs until you find the one
|
|
|
|
that makes the difference.
|
|
|
|
|
|
|
|
Finally, you take all the info that you have, kernel revisions, bug
|
|
|
|
description, the extent to which you have narrowed it down, and pass
|
|
|
|
that off to whomever you believe is the maintainer of that section.
|
|
|
|
A post to linux.dev.kernel isn't such a bad idea if you've done some
|
|
|
|
work to narrow it down.
|
|
|
|
|
|
|
|
If you get it down to a routine, you'll probably get a fix in 24 hours.
|
|
|
|
|
|
|
|
My apologies to Linus and the other kernel hackers for describing this
|
|
|
|
brute force approach, it's hardly what a kernel hacker would do. However,
|
|
|
|
it does work and it lets non-hackers help fix bugs. And it is cool
|
|
|
|
because Linux snapshots will let you do this - something that you can't
|
|
|
|
do with vendor supplied releases.
|
|
|
|
|
2006-03-21 23:37:42 +00:00
|
|
|
Fixing the bug
|
|
|
|
==============
|
|
|
|
|
|
|
|
Nobody is going to tell you how to fix bugs. Seriously. You need to work it
|
|
|
|
out. But below are some hints on how to use the tools.
|
|
|
|
|
|
|
|
To debug a kernel, use objdump and look for the hex offset from the crash
|
|
|
|
output to find the valid line of code/assembler. Without debug symbols, you
|
|
|
|
will see the assembler code for the routine shown, but if your kernel has
|
|
|
|
debug symbols the C code will also be available. (Debug symbols can be enabled
|
|
|
|
in the kernel hacking menu of the menu configuration.) For example:
|
|
|
|
|
|
|
|
objdump -r -S -l --disassemble net/dccp/ipv4.o
|
|
|
|
|
|
|
|
NB.: you need to be at the top level of the kernel tree for this to pick up
|
|
|
|
your C files.
|
|
|
|
|
|
|
|
If you don't have access to the code you can also debug on some crash dumps
|
|
|
|
e.g. crash dump output as shown by Dave Miller.
|
|
|
|
|
|
|
|
> EIP is at ip_queue_xmit+0x14/0x4c0
|
|
|
|
> ...
|
|
|
|
> Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
|
|
|
|
> 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
|
|
|
|
> <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85
|
|
|
|
>
|
|
|
|
> Put the bytes into a "foo.s" file like this:
|
|
|
|
>
|
|
|
|
> .text
|
|
|
|
> .globl foo
|
|
|
|
> foo:
|
|
|
|
> .byte .... /* bytes from Code: part of OOPS dump */
|
|
|
|
>
|
|
|
|
> Compile it with "gcc -c -o foo.o foo.s" then look at the output of
|
|
|
|
> "objdump --disassemble foo.o".
|
|
|
|
>
|
|
|
|
> Output:
|
|
|
|
>
|
|
|
|
> ip_queue_xmit:
|
|
|
|
> push %ebp
|
|
|
|
> push %edi
|
|
|
|
> push %esi
|
|
|
|
> push %ebx
|
|
|
|
> sub $0xbc, %esp
|
|
|
|
> mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb)
|
|
|
|
> mov 0x8(%ebp), %ebx ! %ebx = skb->sk
|
|
|
|
> mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt
|
|
|
|
|
2007-06-01 07:46:50 +00:00
|
|
|
In addition, you can use GDB to figure out the exact file and line
|
|
|
|
number of the OOPS from the vmlinux file. If you have
|
|
|
|
CONFIG_DEBUG_INFO enabled, you can simply copy the EIP value from the
|
|
|
|
OOPS:
|
|
|
|
|
|
|
|
EIP: 0060:[<c021e50e>] Not tainted VLI
|
|
|
|
|
|
|
|
And use GDB to translate that to human-readable form:
|
|
|
|
|
|
|
|
gdb vmlinux
|
|
|
|
(gdb) l *0xc021e50e
|
|
|
|
|
|
|
|
If you don't have CONFIG_DEBUG_INFO enabled, you use the function
|
|
|
|
offset from the OOPS:
|
|
|
|
|
|
|
|
EIP is at vt_ioctl+0xda8/0x1482
|
|
|
|
|
|
|
|
And recompile the kernel with CONFIG_DEBUG_INFO enabled:
|
|
|
|
|
|
|
|
make vmlinux
|
|
|
|
gdb vmlinux
|
|
|
|
(gdb) p vt_ioctl
|
|
|
|
(gdb) l *(0x<address of vt_ioctl> + 0xda8)
|
|
|
|
|
2006-03-21 23:37:42 +00:00
|
|
|
Another very useful option of the Kernel Hacking section in menuconfig is
|
|
|
|
Debug memory allocations. This will help you see whether data has been
|
|
|
|
initialised and not set before use etc. To see the values that get assigned
|
|
|
|
with this look at mm/slab.c and search for POISON_INUSE. When using this an
|
|
|
|
Oops will often show the poisoned data instead of zero which is the default.
|
|
|
|
|
|
|
|
Once you have worked out a fix please submit it upstream. After all open
|
|
|
|
source is about sharing what you do and don't you want to be recognised for
|
|
|
|
your genius?
|
|
|
|
|
|
|
|
Please do read Documentation/SubmittingPatches though to help your code get
|
|
|
|
accepted.
|