Re: Block device throttling [Re: Distributed storage.]

From: Daniel Phillips
Date: Mon Aug 13 2007 - 01:37:02 EST


(previous incomplete message sent accidentally)

On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
>
> So, what did we decide? To bloat bio a bit (add a queue pointer) or
> to use physical device limits? The latter requires to replace all
> occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> somthing_new), where queue limits will be appropriately charged. So
> far I'm testing second case, but I only changed DST for testing, can
> change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as
in your posted patch both suffer from the same problem: you release the
throttling on the previous queue when the bio moves to a new one, which
is a bug because memory consumption on the previous queue then becomes
unbounded, or limited only by the number of struct requests that can be
allocated. In other words, it reverts to the same situation we have
now as soon as the IO stack has more than one queue. (Just a shorter
version of my previous post.)

We can solve this by having the bio only point at the queue to which it
was originally submitted, since throttling the top level queue
automatically throttles all queues lower down the stack. Alternatively
the bio can point at the block_device or straight at the
backing_dev_info, which is the per-device structure it actually needs
to touch.

Note! There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure. A bio
can carry as few as one page or as many as 256 pages. If you take only
one throttle count per bio and that data will be transferred over the
network then you have to assume that (a little more than) 256 pages of
sk_alloc reserve will be needed for every bio, resulting in a grossly
over-provisioned reserve. The precise reserve calculation we want to
do is per-block device, and you will find hooks like this already
living in backing_dev_info. We need to place our own fn+data there to
calculate the throttle draw for each bio. Unthrottling gets trickier
with variable size throttle draw. In ddsnap, we simply write the
amount we drew from the throttle into (the private data of) bio for use
later by unthrottle, thus avoiding the issue that the bio fields we
used to calculate might have changed during the lifetime of the bio.
This would translate into one more per-bio field.

2) Exposing the per-block device throttle limits via sysfs or similar is
really not a good long term solution for system administration.
Imagine our help text: "just keep trying smaller numbers until your
system deadlocks". We really need to figure this out internally and
get it correct. I can see putting in a temporary userspace interface
just for experimentation, to help determine what really is safe, and
what size the numbers should be to approach optimal throughput in a
fully loaded memory state.

Regards,

Daniel


Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/