Re: Winmodem support, some performance tradeoff estimates

Thomas Sailer (sailer@ife.ee.ethz.ch)
Mon, 17 Aug 1998 13:39:44 +0200


Oliver Xymoron wrote:

> An unrolled multiply accumulate _can_ be done in 2 clocks per argument on
> a Pentium, however (hint: the fxchg instruction can be made to take 0(!!)
> clocks if ordered properly). I put together a signal processing app that
> did dot products at 45 mflops on a P90 last year. But this was only if its
> working set fit within the L1 cache.

Hm? Let's see: Add throughput is 1 per cycle, Mul throughput is 1 per
cycle,
but when do you fetch the arguments from L1 cache? Or are they already
in registers when you start your algorithm? Care to post your actual
code?

Tom

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html