Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

From: Jamie Lokier
Date: Thu Sep 04 2003 - 17:51:08 EST


bill davidsen wrote:
> | Why do you need the same piece of data mapped to multiple places
> | in the first place, and why at specific addresses? It's purely an
> | optimization of some sort, right?
>
> I think he said he was doing DSP... there's a trick of double mapping
> the same memory to save one subscript calculation in FFT (or maybe DFT)
> inner loop.

It is for DSP, but nothing to do with FFT. I hadn't ever thought of
using this techinque for FFT, and it would probably make little
difference on a modern CPU given the form of FFT algorithms.

No, I use it to make a circular buffer, in which the data always
appears as a contiguous block - no split. This is useful for
operations on streams of data, such as FIR & IIR filters, equalisers,
upconverters, downcoverters, etc. Many DSP algorithms fall into this
category.

A characteristic of these algorithms is that they consist of a long,
tight sequence of streaming memory accesses with calculations at each
step.

DSP chips often implement circular buffers by masking the offset into
the buffer's memory.

On a CPU, I prefer to avoid the masking operation which happens for
each address calculation. This saves a couple of registers, as I can
just use an incrementing pointer into the buffer, rather than a base
address, offset and mask value. Especially on x86, a couple of
registers saved is good.

It's possible to write DSP algorithms which avoid address masking,
after all a circular buffer in an ordinary array is just two separate
regions. But that complicates the algorithms especially with corner
cases, and some of them are complicated enough already.

Using the duplicate mappings, I can use the most straightforward
streaming DSP code, and it runs as fast as possible if the mappings
don't incur a penalty.

When mappings aren't available or are too slow, then I just copy the
contents of the buffer backwards whenever the write pointer will cross
the end of the array. That costs some, but keeps the DSP code simple.

Fwiw, the test program asseses whether there's a cost to using
duplicate mappings and whether they work. However, for the above kind
of DSP buffer, the measurement isn't the best it could be (although
it's what I'm using). There's a balance of factors. For a large
buffer, it's ok even if page faults were to be needed as we switch
between alias pages, because the access pattern doesn't do that very
often. Then the occasional page faults are just a potentially faster
version of the copy backwards. On the other hand, if aliased pages
are made coherent by making then uncacheable (such as the ARM port),
even though that's much faster than faulting, it isn't good for the
DSP algorithms.

Fwiw#2, in the DSP I'm working on it's better to use the copying
method for most of my buffers even on x86, because they aren't that
large and fit better into L1 cache without the mappings. Maybe for a
different project, it will get used for more of the buffers. Mainly,
having developed the testing code, I wanted to know if it worked
properly on the different architectures. It's nice to see some spin
offs, such as finding the ARM write buffer bug.

So thanks to everyone who responded. I'll post a table of the results
soon.

Thanks,
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/