Re: Block device throttling [Re: Distributed storage.]

From: Evgeniy Polyakov
Date: Mon Aug 13 2007 - 09:08:38 EST


On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips (phillips@xxxxxxxxx) wrote:
> (previous incomplete message sent accidentally)
>
> On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
> >
> > So, what did we decide? To bloat bio a bit (add a queue pointer) or
> > to use physical device limits? The latter requires to replace all
> > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> > somthing_new), where queue limits will be appropriately charged. So
> > far I'm testing second case, but I only changed DST for testing, can
> > change all other users if needed though.
>
> Adding a queue pointer to struct bio and using physical device limits as
> in your posted patch both suffer from the same problem: you release the
> throttling on the previous queue when the bio moves to a new one, which
> is a bug because memory consumption on the previous queue then becomes
> unbounded, or limited only by the number of struct requests that can be
> allocated. In other words, it reverts to the same situation we have
> now as soon as the IO stack has more than one queue. (Just a shorter
> version of my previous post.)

No. Since all requests for virtual device end up in physical devices,
which have limits, this mechanism works. Virtual device will essentially
call either generic_make_request() for new physical device (and thus
will sleep is limit is over), or will process bios directly, but in that
case it will sleep in generic_make_request() for virutal device.

> 1) One throttle count per submitted bio is too crude a measure. A bio
> can carry as few as one page or as many as 256 pages. If you take only

It does not matter - we can count bytes, pages, bio vectors or whatever
we like, its just a matter of counter and can be changed without
problem.

> 2) Exposing the per-block device throttle limits via sysfs or similar is
> really not a good long term solution for system administration.
> Imagine our help text: "just keep trying smaller numbers until your
> system deadlocks". We really need to figure this out internally and
> get it correct. I can see putting in a temporary userspace interface
> just for experimentation, to help determine what really is safe, and
> what size the numbers should be to approach optimal throughput in a
> fully loaded memory state.

Well, we already have number of such 'supposed-to-be-automatic'
variables exported to userspace, so this will not change a picture,
frankly I do not care if there will or will not be any sysfs exported
tunable, eventually we can remove it or do not create at all.

--
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/