Re: Linux 2.6.29

From: Chris Mason
Date: Fri Apr 03 2009 - 11:43:36 EST


On Fri, 2009-04-03 at 08:07 -0700, Linus Torvalds wrote:
>
> On Fri, 3 Apr 2009, Chris Mason wrote:
>
> > On Thu, 2009-04-02 at 20:34 -0700, Linus Torvalds wrote:
> > >
> > > Well, one rather simple explanation is that if you hadn't been doing lots
> > > of writes, then the background garbage collection on the Intel SSD gets
> > > ahead of the game, and gives you lots of bursty nice write bandwidth due
> > > to having a nicely compacted and pre-erased blocks.
> > >
> > > Then, after lots of writing, all the pre-erased blocks are gone, and you
> > > are down to a steady state where it needs to GC and erase blocks to make
> > > room for new writes.
> > >
> > > So that part doesn't suprise me per se. The Intel SSD's definitely
> > > flucutate a bit timing-wise (but I love how they never degenerate to the
> > > "ooh, that _really_ sucks" case that the other SSD's and the rotational
> > > media I've seen does when you do random writes).
> > >
> >
> > 23MB/s seems a bit low though, I'd try with O_DIRECT. ext3 doesn't do
> > writepages, and the ssd may be very sensitive to smaller writes (what
> > brand?)
>
> I didn't realize that Jeff had a non-Intel SSD.
>
> THAT sure explains the huge drop-off. I do see Intel SSD's fluctuating
> too, but the Intel ones tend to be _fairly_ stable.

Even the intel ones have cliffs for long running random io workloads
(where the bottom of the cliff is still very fast), but something like
this should be stable.

>
> > > The fact that it also happens for the regular disk does imply that it's
> > > not the _only_ thing going on, though.
> >
> > Jeff if you blktrace it I can make up a seekwatcher graph. My bet is
> > that pdflush is stuck writing the indirect blocks, and doing a ton of
> > seeks.
> >
> > You could change the overwrite program to also do sync_file_range on the
> > block device ;)
>
> Actually, that won't help. 'sync_file_range()' works only on the virtually
> indexed page cache, and I think ext3 uses "struct buffer_head *" for all
> it's metadata updates (due to how JBD works). So sync_file_range() will do
> nothing at all to the metadata, regardless of what mapping you execute it
> on.

The buffer heads do end up on the block device inode's pages, and ext3
is letting pdflush do some of the writeback. Its hard to say if the
sync_file_range is going to help, the IO on the metadata may be random
enough for that ssd that it won't really matter who writes it or when.

Spinning disks might suck, but at least they all suck in the same
way...tuning for all these different ssds isn't going to be fun at all.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/