Re: clustering page-ins

Jamie Lokier (lkd@tantalophile.demon.co.uk)
Thu, 29 Jul 1999 22:28:59 +0200


Chuck Lever wrote:
> all the read-ahead and clustering parameters are adustable, but how do you
> propose to measure the sweep rate? how can we tell when a cluster becomes
> very large, and thus is affecting other I/O bound jobs?

Sweep rate / read-ahead distance

By definition, the readahead distance is large enough when the pages
are already in cache just as they are faulted. So you can match the
sweep rate by starting small and increasing the readahead distance
whenever a fault hits a page that's not yet in cache.

That strategy will even work for a readahead distance smaller than a
cluster -- in this way a good trigger point within the cluster
(e.g. your half way point) is found. However this is not a good
idea because it requires several clusters to be read before the
right trigger point is found. So it's better (I think) to use
smaller clusters at first. An obvious thing is to limit the cluster
size to the readahead distance.

Here's a sketch of the strategy I'm imagining. Ignore bugs :-)

--> sequential fault is defined as in the range
[end_of_last_read_ahead - read_ahead_distance, trigger_point]
initially no address is in range

--> any non-sequential fault, which includes the first:

end_of_last_read_ahead = ~0 /* force no fault is be sequential */
if faulting page is in cache,
map it & return

read_ahead_distance = small (1 page?)
cluster = read_ahead_distance
this_cluster = max (cluster * 2, nonseq_cluster_size)
this_base = align (fault_address, nonseq_alignment)
end_of_last_read_ahead = this_base + this_cluster
trigger_point = (end_of_last_read_ahead
- min (read_ahead_distance,
end_of_last_read_ahead - this_base))

read-ahead (this_base, this_cluster)
wait_on_page (fault_address)
map it & return

--> otherwise fault is sequential

if (fault_address != trigger_point)
map it & return

read-ahead (end_of_last_read_ahead, cluster)
end_of_last_read_ahead += cluster

if faulting page is not in cache,
read_ahead_distance = min (read_ahead_distance * 2, max_read_ahead)
cluster = min (read_ahead_distance, max_cluster /* 64k-ish */)

trigger_point = (end_of_last_read_ahead
- min (read_ahead_distance,
end_of_last_read_ahead - trigger_point))

Cluster size

Cluster size (max_cluster above) is a mysterious quantity. Pick a
handy figure like 32-128k :-) Maybe it's use the same size as random
access aligned clusters? I.e. let max_cluster ==
nonseq_cluster_size == nonseq_alignment. Or maybe sequential access
will work better with smaller clusters than non-sequential access.

I don't expect any significant gain from large clusters, if the
readahead distance can be larger than a cluster. With hard-fault
triggered readahead it was necessary to use larger clusters because
that reduced I/O waiting time, but soft-fault triggered readahead
isn't supposed to block, so there's no I/O waiting time. The only
thing saved is a little request processing time.

I think ll_rw_block will merge I/O requests if read-ahead is
generating lots of them and they're queueing up (can someone
confirm?)

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/