Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes

From: Chris Mason
Date: Tue May 20 2008 - 13:09:48 EST

Next message: Johannes Berg: "Re: [RFC] make wext wireless bits optional and deprecate them"
Previous message: Louis Rilling: "Re: [RFC][PATCH 0/3] configfs: Make nested default groups lockdep-friendly"
In reply to: Jamie Lokier: "Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes"
Next in thread: Jamie Lokier: "Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tuesday 20 May 2008, Jamie Lokier wrote:
> Chris Mason wrote:
> > > You don't need the barrier after in some cases, or it can be deferred
> > > until a better time. E.g. when the disk write cache is probably empty
> > > (some time after write-idle), barrier flushes may take the same time
> > > as NOPs.
> >
> > I hesitate to get too fancy here, if the disk is idle we probably
> > won't notice the performance gain.
>
> I think you're right, but it's hard to be sure. One of the problems
> with barrier-implemented-as-flush-all is that it flushes data=ordered
> data, even when that's not wanted, and there can be a lot of data in
> the disk's write cache, spread over many seeks.

Jens and I talked about tossing the barriers completely and just doing FUA for
all metadata writes. For drives with NCQ, we'll get something close to
optimal because the higher layer elevators are already doing most of the hard
work.

Either way, you do want the flush to cover all the data=ordered writes, at
least all the ordered writes from the transaction you're about to commit.
Telling the difference between data=ordered from an old transaction or from
the running transaction gets into pushing ordering down to the lower levels
(see below)

>
> Then it's good to delay barrier-flushes to batch metadata commits, but
> good to issue the barrier-flushes prior to large batches of
> data=ordered data, so the latter can be survive in the disk write
> cache for seek optimisations with later requests which aren't yet
> known.
>
> All this sounds complicated at the JBD layer, and IMHO much simpler at
> the request elevator layer.
>
> > But, it complicates the decision about when you're allowed to dirty
> > a metadata block for writeback. It used to be dirty-after-commit
> > and it would change to dirty-after-barrier. I suspect that is some
> > significant surgery into jbd.
>
> Rather than tracking when it's "allowed" to dirty a metadata block, it
> will be simpler to keep a flag saying "barrier needed", and just issue
> the barrier prior to writing a metadata block, if the flag is set.
>

Adding explicit ordering into the IO path is really interesting. We toss a
bunch of IO down to the lower layers with information about dependencies and
let the lower layers figure it out. James had a bunch of ideas here, but I'm
afraid the only people that understood it were James and the whiteboard he
was scribbling on.

The trick is to code the ordering in such a way that an IO failure breaks the
chain, and that the filesystem has some sensible chance to deal with all
these requests that have failed because an earlier write failed.

Also, once we go down the ordering road, it is really tempting to forget that
ordering does ensure consistency but doesn't tell us the write actually
happened. fsync and friends need to hook into the dependency chain to wait
for the barrier instead of waiting for the commit.

But, back to the short term for a second, what we need are some benchmarks for
barriers on and off and some guidance from the ext34 maintainers about
turning them on by default. We shouldn't be pushing this FS integrity
decision off on the distros.

My test prog is definitely a worst case, but I'm pretty confident that most
mail server workloads end up doing similar IO.

A 16MB or 32MB disk cache is common these days, and that is a very sizable
percentage of the jbd log size. I think the potential for corruptions on
power failure is only growing over time.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Johannes Berg: "Re: [RFC] make wext wireless bits optional and deprecate them"
Previous message: Louis Rilling: "Re: [RFC][PATCH 0/3] configfs: Make nested default groups lockdep-friendly"
In reply to: Jamie Lokier: "Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes"
Next in thread: Jamie Lokier: "Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]