Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

From: Ingo Molnar
Date: Thu Mar 22 2018 - 05:33:57 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> And the real worry is things like AVX-512 etc, which is exactly when
> things like "save and restore one ymm register" will quite likely
> clear the upper bits of the zmm register.

Yeah, I think the only valid save/restore pattern is to 100% correctly enumerate
the width of the vector registers, and use full width instructions.

Using partial registers, even though it's possible in some cases is probably a bad
idea not just due to most instructions auto-zeroing the upper portion to reduce
false dependencies, but also because 'mixed' use of partial and full register
access is known to result in penalties on a wide range of Intel CPUs, at least
according to the Agner PDFs. On AMD CPUs there's no penalty.

So what I think could be done at best is to define a full register save/restore
API, which falls back to XSAVE*/XRSTOR* if we don't have the routines for the
native vector register width. (I.e. if old kernel is used on very new CPU.)

Note that the actual AVX code could still use partial width, it's the save/restore
primitives that has to handle full width registers.

> And yes, we can have some statically patched code that takes that into account,
> and saves the whole zmm register when AVX512 is on, but the whole *point* of the
> dynamic XSAVES thing is actually that Intel wants to be able enable new
> user-space features without having to wait for OS support. Literally. That's why
> and how it was designed.

This aspect wouldn't be hurt AFAICS: to me it appears that due to glibc using
vector instructions in its memset() the AVX bits get used early on and to the
maximum, so the XINUSE for them is set for every task.

The optionality of other XSAVE based features like MPX wouldn't be hurt if the
kernel only uses vector registers.

> And saving a couple of zmm registers is actually pretty hard. They're big. Do
> you want to allocate 128 bytes of stack space, preferably 64-byte aligned, for a
> save area? No. So now it needs to be some kind of per-thread (or maybe per-CPU,
> if we're willing to continue to not preempt) special save area too.

Hm, that's indeed a nasty complication:

- While a single 128 bytes slot might work - in practice at least two vector
registers are needed to have enough parallelism and hide latencies.

- &current->thread.fpu.state.xsave is available almost all the time: with our
current 'direct' FPU context switching code the only time there's live data in
&current->thread.fpu is when the task is not running. But it's not IRQ-safe.

We could probably allow irq save/restore sections to use it, as
local_irq_save()/restore() is still *much* faster than a 1-1.5K FPU context
save/restore pattern.

But I was hoping for a less restrictive model ... :-/

To have a better model and avoid the local_irq_save()/restore we could perhaps
change the IRQ model to have a per IRQ 'current' value (we have separate IRQ
stacks already), but that's quite a bit of work to transform all code that
operates on the interrupted task (scheduler and timer code).

But it's work that would be useful for other reasons as well.

With such a separation in place &current->thread.fpu.state.xsave would become a
generic, natural vector register save area.

> And even then, it doesn't solve the real worry of "maybe there will be odd
> interactions with future extensions that we don't even know of".

Yes, that's true, but I think we could avoid these dangers by using CPU model
based enumeration. The cost would be that vector ops would only be available on
new CPU models after an explicit opt-in. In many cases it will be a single new
constant to an existing switch() statement, easily backported as well.

> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?

Ok, so that's not what I'd use it for, I'd use it:

- Speed up existing AVX (crypto, RAID) routines for smaller buffer sizes.
Right now the XSAVE*+XRSTOR* cost is significant:

x86/fpu: Cost of: XSAVE insn: 104 cycles
x86/fpu: Cost of: XRSTOR insn: 80 cycles

... and that's with just 128-bit AVX and a ~0.8K XSAVE area. The Agner PDF
lists Skylake XSAVE+XRSTOR costs at 107+122 cycles, plus there's probably a
significant amount of L1 cache churn caused by XSAVE/XRSTOR.

Most of the relevant vector instructions have a single cycle cost
on the other hand.

- To use vector ops in bulk, well-aligned memcpy(), which in many workloads
is a fair chunk of all memset() activity. A usage profile on a typical system:

galatea:~> cat /proc/sched_debug | grep hist | grep -E '[[:digit:]]{4,}$' | grep '0\]'
hist[0x0000]: 1514272
hist[0x0010]: 1905248
hist[0x0020]: 99471
hist[0x0030]: 343309
hist[0x0040]: 177874
hist[0x0080]: 190052
hist[0x00a0]: 5258
hist[0x00b0]: 2387
hist[0x00c0]: 6975
hist[0x00d0]: 5872
hist[0x0100]: 3229
hist[0x0140]: 4813
hist[0x0160]: 9323
hist[0x0200]: 12540
hist[0x0230]: 37488
hist[0x1000]: 17136
hist[0x1d80]: 225199

First column is length of the area copied, the column is usage count.

To do this I wouldn't complicate the main memset() interface in any way to
branch it off to vector ops, I'd isolate specific memcpy()'s and memset()s
(such as page table copying and page clearing) and use the simpler
vector register based primitives there.

For example we have clear_page() which is used by GFP_ZERO and by other places
is implemented on modern x86 CPUs as:

ENTRY(clear_page_erms)
movl $4096,%ecx
xorl %eax,%eax
rep stosb
ret

While for such buffer sizes the enhanced-REP string instructions are supposed
to be slightly faster than 128-bit AVX ops, for such exact page granular ops
I'm pretty sure 256-bit (and 512-bit) vector ops are faster.

- For page granular memset/memcpy it would also be interesting to investigate
whether non-temporal, cache-preserving vector ops for such known-large bulk
ops, such as VMOVNTQA, are beneficial in certain circumstances.

On x86 there's only a single non-temporal instruction to GP registers:
MOVNTI, and for stores only.

The vector instructions space is a lot richer in that regard, allowing
non-temporal loads as well which utilize fill buffers to move chunks of memory
into vector registers.

Random example: in do_cow_fault() we use copy_user_highpage() to copy the page,
which uses copy_user_page() -> copy_page(), which uses:

ENTRY(copy_page)
ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD
movl $4096/8, %ecx
rep movsq
ret

But in this COW copy case it's pretty obvious that we shouldn't keep the
_source_ page in cache. So we could use non-temporal load, which appear to make
a difference on more recent uarchs even on write-back memory ranges:

https://stackoverflow.com/questions/40096894/do-current-x86-architectures-support-non-temporal-loads-from-normal-memory

See the final graph in that entry and now the 'NT load' variant results in the
best execution time in the 4K case - and this is a limited benchmark that
doesn't measure the lower cache eviction pressure by NT loads.

( The store part is probably better done into the cache, not just due to the
SFENCE cost (which is relatively low at 40 cycles), but because it's probably
beneficial to prime the cache with a freshly COW-ed page, it's going to get
used in the near future once we return from the fault. )

etc.

- But more broadly, if we open up vector ops for smaller buffer sizes as well
then I think other kernel code would start using them as well:

- I think the BPF JIT, whose byte code machine languge is used by an
increasing number of kernel subsystems, could benefit from having vector ops.
It would possibly allow the handling of floating point types.

- We could consider implementing vector ops based copy-to-user and copy-from-user
primitives as well, for cases where we know that the dominant usage pattern is
for larger, well-aligned chunks of memory.

- Maybe we could introduce a floating point library (which falls back to a C
implementation) and simplify scheduler math. We go to ridiculous lengths to
maintain precision across a wide range of parameters, essentially implementing
128-bit fixed point math. Even 32-bit floating point math would possibly be
better than that, even if it was done via APIs.

etc.: I think the large vector processor available in modern x86 CPUs could be
utilized by the kernel as well for various purposes.

But I think that's only worth doing if vector registers and their save areas are
easily accessibly and the accesses are fundamentally IRQ safe.

> Yes, yes, I can find an Intel white-paper that talks about setting WC and then
> using xmm and ymm instructions to write a single 64-byte burst over PCIe, and I
> assume that is where the code and idea came from. But I don't actually see any
> reason why a burst of 8 regular quad-word bytes wouldn't cause a 64-byte burst
> write too.

Yeah, I'm not too convinced about the wide readq/writeq usecase either, I just
used the opportunity to outline these (very vague) plans about utilizing vector
instructions more broadly within the kernel.

> So as far as I can tell, there are basically *zero* upsides, and a lot of
> potential downsides.

I agree about the potential downsides and I think most of them can be addressed
adequately - and I think my list of upsides above is potentially significant,
especially once we have lightweight APIs to utilize individual vector registers
without having to do a full save/restore of ~1K large vector register context.

Thanks,

Ingo