RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: David Laight
Date: Mon Nov 26 2018 - 05:12:20 EST


From: Andy Lutomirski
> Sent: 23 November 2018 19:11
> > On Nov 23, 2018, at 11:44 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> >> On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >>
> >> What is memcpy_to_io even supposed to do? Iâm guessing itâs defined as
> >> something like âcopy this data to IO space using at most long-sized writes,
> >> all aligned, and writing each byte exactly once, in order.â
> >> That sounds... dubiously useful.
> >
> > We've got hundreds of users of it, so it's fairly common..
>
> Iâm wondering if the âat most long-sizesâ restriction matters, especially
> given that weâre apparently accessing some of the same bytes more than once.
> I would believe that trying to encourage 16-byte writes (with AVX, ugh) or
> 64-byte writes (with MOVDIR64B) would be safe and could meaningfully speed
> up some workloads.

The real gains come from increasing the width of IO reads, not IO writes.
None of the x86 cpus I've got issue multiple concurrent PCIe reads
(the PCIe completion tag seems to match the core number).
PCIe writes are all 'posted' so there aren't big gaps between them.

> >> I could see a function that writes to aligned memory in specified-sized chunks.
> >
> > We have that. It's called "__iowrite{32,64}_copy()". It has very few users.

For x86 you want separate entry points for the 'rep movq' copy
and one using an instruction loop.
(Perhaps with guidance to the cutover length.)
In most places the driver will know whether the size is above or below
the cutover - which might be 256.
Certainly transfers below 64 bytes are 'short'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)