Re: Race conditions galore (2.0.33 and possibly 2.1.x)

Linus Torvalds (torvalds@transmeta.com)
Tue, 23 Dec 1997 13:06:03 -0800 (PST)


On Tue, 23 Dec 1997, Stephen R. van den Berg wrote:
>
> Well, they were certainly not marked as running or suspended, but I never
> did say that they were marked as being in disk wait according to ps.
> Actually, ps showed them in a state designated with a dot, like in:
>
> PID TT STAT TIME
> 185 ? . 56:02 /usr/sbin/innd -p4 -r -i0 -c4 -L
>
> I'm not sure where the dot comes from, or what it should designate.
> (I'm using proc-ps as in the bo distribution of Debian).
>
> I forgot to check the current->state from within kdebug, but that's
> because current was not in the context (so gdb told me).

"current" is actually just a macro that expands to the proper thing. You
can use "current_set[0]" instead on UP (on SMP it's a lot harder, although
the proper gdb macros should make it reasonably straightforward on 2.1:
something like "(struct task_struct *)(esp & 0xffffe000)" works there, but
not on 2.0.x).

It would be interesting to see what current_set[0]->state says for the
behaviour..

> > - something clears the locked state without waking people up. Do you
> > use "md" or anything else that plays around with buffers?
>
> Which still makes me kind of wonder why my rearrangement fixes things.
> The only behaviour changed here apparently is that *if*
> during the execution of run_task_queue(&tq_disk) current->state is altered,
> then we don't overwrite it before jumping into schedule().

Right. I moved the current->state setting to just before run_task_queue(),
and maybe you should try that simpler one-liner. It still shouldn't make
any difference, but I still suspect that "md" is doing something to
trigger the problem, and that would probably happen during the task queue
running (the disk tq feeds all the requests to the low-level devices).

> > - really strange K5 bug
>
> Which would be even more difficult to explain in the light of my
> patch.

Oh, I agree. I wasn't really serious, because the K5 bug would have to be
obvious under other circumstances too (ie an out-of-order bug that does
the wrong thing when interrupts happen).

Linus