RE: Possible bug in wait4(), 2.1.126-129 ?

Ion Badulescu (ionut@moisil.cs.columbia.edu)
Mon, 23 Nov 1998 09:56:40 -0500 (EST)


Hi Tris,

Thanks a lot for all your efforts, I really appreciate it.

On Mon, 23 Nov 1998, Greaves Tristan TM wrote:

> Linux discord.bra01.icl.co.uk 2.0.35 #6 Thu Oct 8 19:27:08 BST 1998 i486
> unknown

Ok...

> > What versions of cron, ld-linux.so.2 and glibc do you have?
>
> cron: vixie-cron-3.0.1-24
> ld-linux.so.2: Not sure. What's the best way of determining this?

Just ls -l /lib/ld-linux.so.2.* should do. But it doesn't seem to be
relevant..

> glibc: glibc-2.0.7-13
>
> > Just for the peace of my mind, can you try running the C
> > program I posted earlier and see if you get the same result?

[...]

> So there you go. Effectively the same behaviour.

Thanks for trying.. I get the same thing here, in fact the C program was
tailored after the strace output for the perl script, and then reduced to
the minimum that can still exhibit the problem.

> Have you tried it with non- RH5.1 systems then? If so, it looks like
> its a distribution problem and not a kernel one. Probably related to
> the version of cron in use. Try installing a different crond and see
> what happens.....

I tried it on rh42, which is libc5-based, and it doesn't exhibit this
problem. I tried running the rh42 binary on the rh51 system, and it _does_
have the problem. So yes, at the first glance it would appear that crond
is at fault. I'm sure I could install a different crond, or a different
_something_ that would mask the problem. That's not the point though...

The real point is that what we see should never ever happen, regardless of
libc, cron or pretty much anything the userspace can do. Only the current
direct parent of a zombie can "wait" for it. Yet, in my test case the
zombie effectively disappears without the parent waiting for it and before
the parent itself finished. I checked that this is indeed the case by
inserting a longer sleep() in the parent, and then checking the process
table -- the child is gone for good! Needless to say, this completely
breaks UNIX process semantics...

The subject of this thread is therefore wrong. Not wait4() is broken,
something more devious must be happening. Either the kernel just "loses"
the child, or somehow it gets confused and moves the child into process 1
(init)'s children list, even though it's not orphan yet. Either way, this
is a bug, and the fact that even 2.0.xx has this bug makes me even more
nervous about it..

Next thing I am going to do is instrument my init to log the pid of all
the children it waited for. That will hopefully isolate the bug a little
and give me a lead to the exact race condition. I'll keep you posted.

Thanks a lot for all your cooperation,

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
            than to open it and remove all doubt.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/