Re: Distributed storage.

From: Jens Axboe
Date: Mon Aug 13 2007 - 08:46:21 EST


On Sun, Aug 12 2007, Daniel Phillips wrote:
> On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
> > I don't like structure bloat, but I do like nice design. Overloading
> > is a necessary evil sometimes, though. Even today, there isn't enough
> > room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
> > so that concern can be scratched. If you read bio.h, that much is
> > obvious.
>
> Sixteen bits in bi_rw are consumed by queue priority. Is there a reason
> this lives in struct bio instead of struct request?

If you don't, you have to pass them down. You can make that very
statement about basically any member of struct bio, until we end up with
a submit_bio() path and down taking 16 arguments.

> > If you check up on the iommu virtual merging, you'll understand the
> > front and back size members. They may smell dubious to you, but
> > please take the time to understand why it looks the way it does.
>
> Virtual merging is only needed at the physical device, so why do these
> fields live in struct bio instead of struct request?

A bio does exist outside of a struct request, and bio buildup also
happens before it gets attached to such.

> > Changing the number of bvecs is integral to how bio buildup current
> > works.
>
> Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can
> derive efficiently from BIO_POOL_IDX() provided the bio was allocated
> in the standard way.

That would only be feasible, if we ruled that any bio in the system must
originate from the standard pools.

> This leaves a little bit of clean up to do for bios not allocated from
> a standard pool.

Please suggest how to do such a cleanup.

> Incidentally, why does the bvl need to be memset to zero on allocation?
> bi_vcnt already tells you which bvecs are valid and the only field in a
> bvec that can reasonably default to zero is the offset, which ought to
> be set set every time a bvec is initialized anyway.

We could probably skip that, but that's an unrelated subject.

> > > bi_destructor could be combined. I don't see a lot of users of
> > > bi_idx,
> >
> > bi_idx is integral to partial io completions.
>
> Struct request has a remaining submission sector count so what does
> bi_idx do that is different?

Struct request has remaining IO count. You still need to know where to
start in the bio.

> > > that looks like a soft target. See what happened to struct page
> > > when a couple of folks got serious about attacking it, some really
> > > deep hacks were done to pare off a few bytes here and there. But
> > > struct bio as a space waster is not nearly in the same ballpark.
> >
> > So show some concrete patches and examples, hand waving and
> > assumptions is just a waste of everyones time.
>
> Average struct bio memory footprint ranks near the bottom of the list of
> things that suck most about Linux storage. At idle I see 8K in use
> (reserves); during updatedb it spikes occasionally to 50K; under a
> heavy load generated by ddsnap on a storage box it sometimes goes to
> 100K with bio throttling in place. Really not moving the needle.

Then, again, stop wasting time on this subject. Just because struct bio
isn't a huge bloat is absolutely no justification for adding extra
members to it. It's not just about system wide bloat.

> On the other hand, vm writeout deadlock ranks smack dab at the top of
> the list, so that is where the patching effort must go for the
> forseeable future. Without bio throttling, the ddsnap load can go to
> 24 MB for struct bio alone. That definitely moves the needle. in
> short, we save 3,200 times more memory by putting decent throttling in
> place than by saving an int in struct bio.

Then fix the damn vm writeout. I always thought it was silly to depend
on the block layer for any sort of throttling. If it's not a system wide
problem, then throttle the io count in the make_request_fn handler of
that problematic driver.

> That said, I did a little analysis to get an idea of where the soft
> targets are in struct bio, and to get to know the bio layer a little
> better. Maybe these few hints will get somebody interested enough to
> look further.
>
> > > It would be interesting to see if bi_bdev could be made read only.
> > > Generally, each stage in the block device stack knows what the next
> > > stage is going to be, so why do we have to write that in the bio?
> > > For error reporting from interrupt context? Anyway, if Evgeniy
> > > wants to do the patch, I will happily unload the task of convincing
> > > you that random fields are/are not needed in struct bio :-)
> >
> > It's a trade off, otherwise you'd have to pass the block device
> > around a lot.
>
> Which costs very little, probably less than trashing an extra field's
> worth of cache.

Again, you can make that argument for most of the members. It's a
non-starter.

> > And it's, again, a design issue. A bio contains
> > destination information, that means device/offset/size information.
> > I'm all for shaving structure bytes where it matters, but not for the
> > sake of sacrificing code stability or design. I consider struct bio
> > quite lean and have worked hard to keep it that way. In fact, iirc,
> > the only addition to struct bio since 2001 is the iommu front/back
> > size members. And I resisted those for quite a while.
>
> You did not comment on the one about putting the bio destructor in
> the ->endio handler, which looks dead simple. The majority of cases
> just use the default endio handler and the default destructor. Of the
> remaining cases, where a specialized destructor is needed, typically a
> specialized endio handler is too, so combining is free. There are few
> if any cases where a new specialized endio handler would need to be
> written.

We could do that without too much work, I agree.

> As far as code stability goes, current kernels are horribly unstable in
> a variety of contexts because of memory deadlock and slowdowns related
> to the attempt to fix the problem via dirty memory limits. Accurate
> throttling of bio traffic is one of the two key requirements to fix
> this instability, the other other is accurate writeout path reserve
> management, which is only partially addressed by BIO_POOL.

Which, as written above and stated many times over the years on lkml, is
not a block layer issue imho.

> Nice to see you jumping in Jens. Now it is over to the other side of
> the thread where Evgeniy has posted a patch that a) grants your wish to
> add no new field in struct bio and b) does not fix the problem.

This is why it's impossible for me to have any sort of constructive
conversation with you.

--
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/