Putrid Elevator Behavior 2.4.18/19

From: Jeff V. Merkey (jmerkey@vger.timpanogas.org)
Date: Wed Mar 20 2002 - 14:04:55 EST


Jens/Linux,

The elevator code is malfunctioning in 2.4.18/19-pre when we start
reaching the upward limits with multiple 3Ware adapters
running together. We started seeing the problem when we went to
64 K aligned writes with sustained > 200 MB/S writes to
multiple 3Ware adapters.

We have verified that the 3Ware adapters are not holding the request
off, but that one of the requests is getting severely starved and
does not get posted to the 3Ware adapters until thousands of IOs
have gone before it.

The basic symptom is a lower offset 4K write gets hung in the elevator
as it traverses a very long list of requests being written linearly
to a disk device. Both Darren and I have seen this problem in NetWare
with remirroring, which is why we went to the A/B alternating
list to prevent this type of starvation. There are a very small number
of reads being posted during this test to update meta-data.

The data that is being held off is meta-data that occupies a lower sector
offset on the device. This startvation error is very troublesome and
results in certain sectors not being freed up as anticipated, which
results in a fatal error for our system. The elevator ends of getting
very far behind.

By way of example, this delayed write is held off for several **MINUTES**.
This is severaly broken.

Please let me know what other information you would like Darren and I
to run down and provide. We are at present coding some changes into
your elevator to implement an A and B list so this startvation problem
is completely avoided.

Please advise,

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Mar 23 2002 - 22:00:22 EST