Re: [PATCH] x86: only use ERMS for user copies for larger sizes

From: Linus Torvalds
Date: Fri Nov 23 2018 - 11:36:25 EST


On Fri, Nov 23, 2018 at 2:12 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
> Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
> generates a lot of single byte TLP.

I just tested it too - it turns out that the __inline_memcpy() code
never triggers, and "memcpy_toio()" just generates a memcpy.

So that code seems entirely dead.

And, in fact, the codebase I looked at was the historical one, because
I had been going back and looking at the history. The modern tree
*does* have the "__inline_memcpy()" function I pointed at, but it's
not actually hooked up to anything!

This actually has been broken for _ages_. The breakage goes back to
2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and
it seems nobody really ever noticed - or thought that it was ok.

That commit claims that iomem has no special significance on x86, but
that really really isn't true, exactly because the access size does
matter.

And as mentioned, the generic memory copy routines are not at all
appropriate, and that has nothing to do with ERMS. Our "do it by hand"
memory copy routine does things like this:

.Lless_16bytes:
cmpl $8, %edx
jb .Lless_8bytes
/*
* Move data from 8 bytes to 15 bytes.
*/
movq 0*8(%rsi), %r8
movq -1*8(%rsi, %rdx), %r9
movq %r8, 0*8(%rdi)
movq %r9, -1*8(%rdi, %rdx)
retq

and note how for a 8-byte copy, it will do *two* reads of the same 8
bytes, and *two* writes of the same 8 byte destination. That's
perfectly ok for regular memory, and it means that the code can handle
an arbitrary 8-15 byte copy without any conditionals or loop counts,
but it is *not* ok for iomem.

Of course, in practice it all just happens to work in almost all
situations (because a lot of iomem devices simply won't care), and
manual access to iomem is basically extremely rare these days anyway,
but it's definitely entirely and utterly broken.

End result: we *used* to do this right. For the last eight years our
"memcpy_{to,from}io()" has been entirely broken, and apparently even
the people who noticed oddities like David, never reported it as
breakage but instead just worked around it in drivers.

Ho humm.

Let me write a generic routine in lib/iomap_copy.c (which already does
the "user specifies chunk size" cases), and hook it up for x86.

David, are you using a bus analyzer or something to do your testing?
I'll have a trial patch for you asap.

Linus