Re: File-descriptors - large quantities

Michael O'Reilly (michael@metal.iinet.net.au)
Fri, 10 Jul 1998 09:32:25 +0800


Yes, I saw the paper. However, the issue here is the average
connection length vs the maximum number of disk I/Os per second
possible.

For single process (i.e. can only have one request in the disk queue
at a time), on 10ms disks (say), and a HUGE ram cache (~ 1.5% of disk
space), you'd be able to cache all the metadata, so you're on a single
seek to read/write data (say). This gives you a maximum connection
rate of
max_rate == disk_IOs_per_second * 1.5
as around 66% of all requests will require disk access. This then
gives a maximum FD requirement of

FDs = max_rate * connection_length

So for 5 second connection length (that's worse than measured in
australia on terrible links with 500ms RTTs), and a disk subsystem
that can handle 100tps, you're on a maximum of 750 FDs in use without
saturating your disk. It doesn't really matter about the distribution
of the lengths; the max number of new connections per second is
determined by disk speed, and the max number of FDs in use is
determined by connection rate and connection length. The shape of the
distributions don't matter.

If the real rate of new connections goes higher than 'max_rate', then
the number of FDs in use starts blowing out in a _major_ way because
you can't clear them as fast as you're getting them.

Adding more FDs obviously won't fix the problem. It'll just mean that
instead of taking 5 mins to run out of FDs, it'll take 10 mins, or 15
mins. The real solution is to increase your maximum disk transactions
per second; either by adding more ram (decreasing ram cache miss
rate), faster disks (reducing seek times), or more processes (more
requests outstanding at once gives you a gain from multiple disks, and
request ordering).

In message <199807091428.PAA01010@dax.dcs.ed.ac.uk>, "Stephen C. Tweedie" write
s:
> There was an interesting paper on this at the recent Usenix.
[...]
> This is a real problem: you don't have to have a massively busy
> server, just a server requesting items from a slow domain, for the
> number of outstanding connections at a time to grow enormously.

This would be reflected in a much larger average connection
length. And yes, if the avg connection length went to 20 seconds, then
you're in the 3k FD range, but the measured lengths at the site are
sub 5 seconds (This fits with my own measurements and corresponds
fairly well with the measurements in the paper given the differences
in RTTs we see here in oz).

The real problem is that single process == single request in the disk
queue. This means no matter how big your disk array, you're bound by
the 1/('average seek time on the disk' * 'the ram cache miss rate' *
'disk I/Os per open(2).' )

If you move to a multi-process system (threads, clone(2), whatever),
you can get more than one request outstanding at once which means your
capacity scales as 'number of disks' * 'gain from I/O sorting' * above,
which is suddenly moving you from 60 - 80 tps to 400 tps and up.

Michael.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu