> We have alot of production machines here, all running RH5.2 and 2.2.9
> kernel. For some reason, our primary and secondary nameservers keep
> crashing all of the sudden. The kernel has been on them since the day it
> came out, but out of the blue they keep crashing multiple times per day
> with the following Oops:
>
> kmem_free: Bad obj addr (objp = c4110b80, name = size-128)
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> *pde = 00000000
> Oops: 0002
> CPU: 0
> EIP: 0010:[<c011e3c9>]
> EFLAGS: 00010202
> eax: 00000039 ebx: c0373140 ecx: c0202000 edx: ffffff9d
> esi: c3549e00 edi: 00000202 ebp: 00000001 esp: c0203e1c
> ds: 0018 es: 0018 ss: 0018
> Process swapper (pid: 0, process nr: 0, stackpage=c0203000)
>
>
> >From my System.map, the EIP line points to:
>
> c011e244 T kfree
>
<snip>
> Other than that, their is nothing special about these boxes. All the
> RH5.2 updates have been installed. What can I do to further narrow down
> the source of the problem?
Hi Brian,
(not a guru, so can only suggest ways I would approach the problem)
If you have a spare machine lying around that you can set up to be a
temporary nameserver, any one of the options below might get you closer.
The downside is that you have to be able to tolerate nameserver death
briefly, and have time to do lots and LOTS of digging.
Now, if you're still reading :) you could....
Install IKD, enable ktracer with a modest trace buffer. Set up a serial
console on another machine and use IKD's ability to dump the trace buffer
to console for later use (SysRQ-g). Since SysRQ is dead after panic time,
you could modify kernel/panic.c:panic() to force the trace by inserting
something like this (copied from drivers/char/sysrq.c)..
#ifdef CONFIG_TRACE
printk(KERN_INFO "DAL: ktrace start\n");
ktrace_to_console();
printk(KERN_INFO "DAL: ktrace end\n");
#endif
..after if (panic_timeout > 0). Boot the machine with a panic=N kernel
parameter and wait for kaboom to happen. The tail of the trace should
show what led up to the problem. This effectively turns a simple oops
into a megaoops with tons of kernel history (after ktrace processing).
Resources: Documentation/serial-console.txt
Documentation/ktrace.txt
ftp://e-mind.com/pub/andrea/ikd2.2.9_ikd1.bz2
Another way would be to modify memleak such that you can use it's wrap
method to finger the original caller trying to free this bad address.
If you're interested in this route, and don't know how, contact me
privately, and I'll make you an incremental diff to do this [1]. (it's
easy.. wrap kfree, and pass pointer to alloc_struct down the stack.
__kmem_cache_free() is the guy generating the oops in this case, so
printk IDPTR->file and IDPTR->line prior to oops. You can also examine
memleak's allocation accounting array and if the alloc_struct pointer
is still present [already freed addr might be the problem], printk where
it was originally allocated from.)
'nuther option..
Install SGI's kdb. It pops up automagically upon oops, so you can poke
around in a hot (and busted) system before it dies. Very spiffy IMHO.
-Mike
Caveat to patch promise: reaction time might be suboptimal.. building
house and have absolutely merciless construction boss (aka wife:-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/