Re: [rfc][patch] store-free path walking

From: Jens Axboe
Date: Thu Oct 08 2009 - 09:31:42 EST


On Thu, Oct 08 2009, Nick Piggin wrote:
> On Thu, Oct 08, 2009 at 02:57:46PM +0200, Jens Axboe wrote:
> > On Thu, Oct 08 2009, Nick Piggin wrote:
> > > On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote:
> > > > On Wed, 7 Oct 2009, Nick Piggin wrote:
> > > > >
> > > > > OK, I have a really basic patch that does store-free path walking
> > > > > (except on the final element).
> > > >
> > > > Yay!
> > > >
> > > > > dbench is pretty nasty still because it seems to do a lot of stupid
> > > > > things like reading from /proc/mounts all the time.
> > > >
> > > > You should largely forget about dbench, it can certainly be a useful
> > > > benchmark, but at the same time it's certainly not a _meaningful_ one.
> > > > There are better things to try.
> > >
> > > OK, here's one you might find interesting. It is a cached git diff
> > > workload in a linux kernel tree. I actually ran it in a loop 100
> > > times in order to get some reasonable sample sizes, then I ran
> > > parallel and serial configs (PreloadIndex = true/false). Compared
> > > plain kernel with all vfs patches to now.
> > >
> > > 2.6.32-rc3 serial
> > > 5.35user 7.12system 0:12.47elapsed 100%CPU
> > >
> > > 2.6.32-rc3 parallel
> > > 5.79user 17.69system 0:09.41elapsed 249%CPU
> > >
> > > vfs serial
> > > 5.30user 5.62system 0:10.92elapsed 100%CPU
> > >
> > > vfs parallel
> > > 4.86user 0.68system 0:06.82elapsed 81%CPU
> >
> > Since the box was booted anyway, I tried the git test too. Results are
> > with 2.6.32-rc3 serial being the baseline 1.00 scores, smaller than 1.00
> > are faster and vice versa.
> >
> > 2.6.32-rc3 serial
> > real 1.00
> > user 1.00
> > sys 1.00
> >
> > 2.6.32-rc3 parallel
> > real 0.80
> > user 0.83
> > sys 8.38
> >
> > sys time, auch...
> >
> > vfs serial
> > real 0.86
> > user 0.93
> > sys 0.84
>
> This is actualy nice too. My tests were on a 2s8c Barcelona system,
> but this is showing we have a nice serial win on Nehalem as well.
> Actually K8 CPUs have a bit faster lock primitives than earlier
> Intel CPUs I think (closer to Nehalem), so we might see an even
> bigger win with a Core2.

Yes, this is just as interesting as the parallel results imho. I don't
have a core 2 test box, so I cannot test that.

> > vfs parallel
> > real 0.43
> > user 0.72
> > sys 0.13
> >
> > Let me know if you want profiles or anything like that. I'd say that
> > looks veeeery tasty.
>
> It doesn't look all that different to mine, so profiles probably
> not required at this point. Is the CPU accounting going wrong? It
> looks like thread times are not being accumulated back properly,
> which IIRC they should be. But 'real' time should be accurate, so
> it is going a lot faster.

Yes, it looks very similar, the higher CPU count just makes the parallel
git preload on -rc3 stock look even more crappy (you had roughly 2x
number of sys time, I have roughly 8x) when compared to the serialized
approach.

IIRC, there was a bug with thread accounting very recently. Why would it
not hit -rc3 alone, though? Does look fishy, though.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/