Re: Race conditions galore (2.0.33 and possibly 2.1.x)

MOLNAR Ingo (mingo@chiara.csoma.elte.hu)
Tue, 23 Dec 1997 13:01:15 +0100 (CET)


On 22 Dec 1997, Linus Torvalds wrote:

> >Maybe the bug is that something marked the buffers as not locked without
> >waking anything up? Then your change in ordering might make a
> >difference, if the buffer has been touched multiple times.
>
> Ahh, the md driver does indeed do something like this. The md driver
> will clear the lock bit without ever waking up anybody that waits on it,

genhd.c does some scary stuff indeed, but otherwise the mainline raid0
code seems to be safe. Unless people are playing around with fdisk on a
running newsserver, the genhd.c ugliness should have no affect. (i'm now
assuming 2.0.33)

_but_, we do have one unresolved RAID issue, which could be exactly this
problem: we have RAID1-5 'lockup' reports. One happened with a BusLogic
card, it's a process hanging the same way as described before. We usually
test on EIDE, and never saw such lockups. Looking at things from this
angle, the bug depends on IO-parallelizm directly, _maybe_ the SCSI code.
On the other hand, Leonard told me he gives RAID0 some really heavy
testing, on his Buslogic disk farm, and he never saw such problems. But
one thing seems to be sure, there are heavily used MD systems that do not
have this anomaly. And that this bug never causes a crash.

another buffer.c thing i always found very volatile is the 'reuse list'.
We 'free' buffer heads in IRQ contexts, but we still use them 'for some
time', and we 'reuse' these buffer heads from the syscall level then. Is
it somewhere enforced that we can never accidentally recirculate the reuse
list from IRQ contexts? Does the following recover_reusable_buffer_heads
change pop up something on systems that show the 'stuck process' bug?

static inline void recover_reusable_buffer_heads(void)
{
if (intr_count)
printk("badness!\n");
...

admittedly a very low chance, but maybe this pops up something?

> I will consider this an md bug until you tell me that you aren't
> actually using raid at all, at which point I'll go back to scratching my
> head.

i think this is just a coincidence, raid0 means much higher paralellizm,
and much higher load. Apart from the genhd thing.

-- mingo