Re: Locking L1 cache lines in Cyrix 6x86MX CPUs

=?ISO-8859-1?Q?Andr=E9?= Derrick Balsa (andrebalsa@altern.org)
Wed, 20 May 1998 00:56:38 -0100

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: =?ISO-8859-1?Q?Andr=E9?= Derrick Balsa: "Re: Locking L1 cache lines in Cyrix 6x86MX CPUs"
Previous message: Peter Horton: "Re: RPMs :o("
Maybe in reply to: =?ISO-8859-1?Q?Andr=E9?= Derrick Balsa: "Locking L1 cache lines in Cyrix 6x86MX CPUs"
Next in thread: =?ISO-8859-1?Q?Andr=E9?= Derrick Balsa: "Re: Locking L1 cache lines in Cyrix 6x86MX CPUs"

Hi,

Mike Jagdis wrote:
>
> On Tue, 19 May 1998, Andr Derrick Balsa wrote:
>
> > Hmmm. _Very_ interesting. I was thinking that perhaps the timer
> > interrupt code could be kept in such a locked cache line, because on a
> > busy machine it probably gets overwritten between the 10 ms periodic
> > interrupts. But that's a hypothesis. Nobody seems to have quantitative
> > data on this precise subject.
>
> 10ms is _forever_ in modern CPU terms :-). If the system is pretty
> much idle the timer code will probably be cached anyway. If the
> system isn't idle you need to decide whether you _really_ want to
> potentially reduce application performance just so the tick handler
> goes fast.

Setting aside 1Kb would have a maximum theoretical impact of 1.5% *
(slowdown going from L1 cache to L2 cache), from Amdahl's Law.

OTOH, since the timer interrupt servicing routine runs with interrupts
disabled, it's critical code. On a busy machine, that would avoid having
the L1 cache refilled due to cache misses on every tick, a CPU clock
cycle expensive task for both the timer routine _and_ the applications
that get interrupted.
>
> > > My own feeling is that this is not so useful as it might appear
> > > at first glance. If you _really_ want to try something interesting
> > > why not write a gcc back end that uses a locked L1 line as a nice
> > > big register file and see if you can push the x86 architecture to
> > > new heights?
> >
> > That's another very interesting possible application for the 6x86MX L1
> > cache locked lines. The x86 instruction set allows most instructions to
> > address memory instead of CPU registers, with no additional CPU clock
> > cycles.
>
> But you do have to be careful because you lose the benefits of
> register renaming and the like so you may _think_ you are doing
> well but the pipelines could be foaming like cheap lager on a
> hot day...

Very poetic image :), but that's not quite so IMHO. Remember the L1
cache can service both pipelines simultaneously in a single clock cycle,
since it's dual ported.
>
> > Since the L1 cache is dual ported, works at the core clock speed and has
> > no more latencies than the usual x86 registers, locking a 1Kb region
> > could amount to having 256 general-purpose 32-bit registers.
> >
> > When one realizes how much gymnastic gcc is forced to do because of the
> > scarcity of registers in the x86 architecture, one begins to wonder how
> > much of a performance gain one could get with 256 more registers.
> >
> > Thanks for the tip :) Now who do I contact for more information on a
> > possible gcc x86 back end?
>
> I would start by reading the gcc source and studying the existing
> back ends for x86 and a register rich one like Alpha. Next year
> you might want to try changing a few things...
>
> I had thought that it might be possible just to have gcc use an
> explicitly locked region for temporaries. Then it occurred to me
> that temporaries will usually be pretty much clustered together
> on the same cache line anyway so there may not be that much benefit
> - except in the case where they are used either side of a function
> call or two in which case a locked scratchpad _might_ help, but then
> again the called functions may work better with the extra cache
> space...

Even if temporaries are clustered on a cache line, you can still see the
CPU swapping data around like mad between the cache and the registers,
because it tries to do as much as possible inside the registers. I am no
expert in compiler techniques, but it seems gcc like many modern
compilers feels more comfortable on a RISC CPU with a register-rich
architecture.

As far as I know, register allocation is done at the end of the global
optimization pass, just before code generation, using a technique called
graph coloring - which works best when the CPU has at least 16 registers
available, obviously not the case with x86 machines.
>
> Head hurting yet? :-)

Not yet. :-)

Cheers,
------------------------
André Balsa
andrebalsa@altern.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu

Next message: =?ISO-8859-1?Q?Andr=E9?= Derrick Balsa: "Re: Locking L1 cache lines in Cyrix 6x86MX CPUs"
Previous message: Peter Horton: "Re: RPMs :o("
Maybe in reply to: =?ISO-8859-1?Q?Andr=E9?= Derrick Balsa: "Locking L1 cache lines in Cyrix 6x86MX CPUs"
Next in thread: =?ISO-8859-1?Q?Andr=E9?= Derrick Balsa: "Re: Locking L1 cache lines in Cyrix 6x86MX CPUs"