The SMP fpu code is subtley different as it saves the FPU state on all context
switches and "knows" the FPU won't be used early in the system startup before
the processes are set up right. I guess that is what causes his locks. The
other item is the kernel process cannot sleep doing an FPU copy as it might
wake up on the _other_ processor.
Have you looked at using the integer unit to asynchronously touch cache lines
ahead of the FPU btw ?
Now we need FPU copy/checksum 8). On a more serious note the next generation
Intel CPU's with the 57 new "multimedia" instructions like dot product will
let us down even better block copies and also copy/csum's looking at the intel
blurb.
Alan