Re: kernel ooops... no timeout on find

Alexander Viro (viro@math.psu.edu)
Tue, 11 May 1999 04:10:50 -0400 (EDT)


On Tue, 11 May 1999, Craig Armour wrote:

> Hi,
>
> here is a kernel oops I seem to get very regularly. the system is an
> Asus P2bDS? dual PII with 512meg ram kernel is 2.2.7 . There are no NFS
> file systems on this box.
>
> I think this could be linked to a problem I'm having with phantom loads
> and sleeping find process running in state D
>
>
> 2:54pm up 6 days, 1:03, 6 users, load average: 1.03, 1.01, 1.00
> 86 processes: 85 sleeping, 1 running, 0 zombie, 0 stopped
> CPU states: 1.1% user, 2.7% system, 0.0% nice, 96.2% idle
>
> 3554 ? D 0:01 find / ( -fstype nfs -o -fstype NFS -o -type d -regex
> \(^/tm
>
> the problem is occuring in both smp mode and non-smp mode. I'm guessing
> (becuase I don't know much of how kernels work...) that there is no time
> out for NFS file systems... in the interim, I'm going to kill updatedb
> from the cron as I believe this is the parent process. I can't kill the
> find process.
>
> any patches/ideas would be very helpfull... at the moment, I'm
> struggling to get more than two weeks of uptime out of what should be an
> enterprise server.

Darn... I see what happened, but how... You've got the following nice
combination:
a) extra pointer to dentry. OK, that would normally give you
KERN_CRIT-level printk followed by forced oops.
b) non-NULL d_parent not equal to dentry in question.
c) NULL (or otherwise invalid) pointer in parent's d_name.
That way you got an oops trying to pass arguments to printk.

Sheesh... It may be a random memory corruption from whatever reason, but
since you have it reproducably I suspect that it's a real dcache corruption.

Could you try the following:
--- fs/dcache.c Thu Apr 29 07:48:57 1999
+++ fs/dcache.c.new Tue May 11 04:21:13 1999
@@ -153,10 +153,19 @@
return;
}

- printk(KERN_CRIT "Negative d_count (%d) for %s/%s\n",
- count,
- dentry->d_parent->d_name.name,
- dentry->d_name.name);
+ printk(KERN_CRIT "Shit happens in dput():\n");
+ printk(KERN_CRIT " d_count=%d, d_inode = %p, d_name=%p\n",
+ dentry->d_count,dentry->d_inode,dentry->d_name);
+ if (dentry->d_inode) {
+ printk(KERN_CRIT " i_ino=%d, i_dev = %d\n",
+ dentry->d_inode->i_ino,
+ kdev_t_to_nr(dentry->d_inode->i_dev));
+ }
+ if (dentry->d_name)
+ printk(KERN_CRIT " d_name.name=%s\n", dentry->d_name.name);
+ if (dentry->d_parent && dentry->d_parent->d_name)
+ printk(KERN_CRIT " parent: d_name.name=%s\n",
+ dentry->d_parent->d_name.name);
*(int *)0 = 0;
}

... and look what it will give. inode number and device may at least hint
where the hell it happens.
It's not the first report indicating that something is wrong with
dcache and this one at least may narrow the things down. Please, post the
resulting log as soon as you'll reproduce the oops with the patched
variant in place, OK?
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/