Re: [beta patch] SSE copy_page() / clear_page()

From: Andrew Morton (
Date: Fri Feb 16 2001 - 10:27:02 EST

Manfred Spraul wrote:
> Intel Pentium III and P 4 have hardcoded "fast stringcopy" operations
> that invalidate whole cachelines during write (documented in the most
> obvious place: multiprocessor management, memory ordering)

Which are dramatically slower than a simple `mov' loop for just
about all alignments, except for source and dest both eight-byte

For example, copying an unchached source to an uncached dest,
with the source misaligned, my PIII Coppermine does 108 MBytes/sec
with `rep;movsl' and 149 MBytes/sec with an open-coded variant
of our copy_csum routines. That's a lot. Similar results
on a PII and a PIII Katmai.

On the K6-2, however, the string operation is almost always
a win.

It seems that a good approximation for our bulk-copy strategy is:

        if (AMD) {
        } else if (intel) {
                if ((source|dest) & 7)
        } else {

This will make our Intel copies 20-40% faster than
at present, depending upon the distribution of
alignments. (And for networking, the distribution
is pretty much uniform).

Somewhere on my to-do list is getting lots of people to
test lots of architectures with lots of combinations of
[source/dest][cached/uncached] at lots of alignments
to confirm if this will work.

If you have time, could you please grab

and teach it how to do SSE copies, in preparation for this
great event?


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

This archive was generated by hypermail 2b29 : Fri Feb 23 2001 - 21:00:13 EST