Re: [RFC] syscall calling convention, stts/clts, and xstate latency

From: Ingo Molnar
Date: Mon Jul 25 2011 - 02:39:45 EST

Next message: Neha Singhal: "Crash while using completions"
Previous message: Marcel Selhorst: "Re: [PATCH v2] char/tpm: Add new driver for Infineon I2C TIS TPM"
In reply to: Ingo Molnar: "Re: [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to"
Next in thread: Andrew Lutomirski: "Re: [RFC] syscall calling convention, stts/clts, and xstate latency"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Andrew Lutomirski <luto@xxxxxxx> wrote:

> On Sun, Jul 24, 2011 at 5:15 PM, Ingo Molnar <mingo@xxxxxxx> wrote:
> >
> > * Andrew Lutomirski <luto@xxxxxxx> wrote:
> >
> >> I was trying to understand the FPU/xstate saving code, and I ran
> >> some benchmarks with surprising results. These are all on Sandy
> >> Bridge i7-2600. Please take all numbers with a grain of salt --
> >> they're in tight-ish loops and don't really take into account
> >> real-world cache effects.
> >>
> >> A clts/stts pair takes about 80 ns. Accessing extended state from
> >> userspace with TS set takes 239 ns. A kernel_fpu_begin /
> >> kernel_fpu_end pair with no userspace xstate access takes 80 ns
> >> (presumably 79 of those 80 are the clts/stts). (Note: The numbers
> >> in this paragraph were measured using a hacked-up kernel and KVM.)
> >>
> >> With nonzero ymm state, xsave + clflush (on the first cacheline of
> >> xstate) + xrstor takes 128 ns. With hot cache, xsave = 24ns,
> >> xsaveopt (with unchanged state) = 16 ns, and xrstor = 40 ns.
> >>
> >> With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38
> >> ns and xsaveopt saves another 5 ns.
> >>
> >> Zeroing the state completely with vzeroall adds 2 ns. Not sure
> >> what's going on.
> >>
> >> All of this makes me think that, at least on Sandy Bridge, lazy
> >> xstate saving is a bad optimization -- if the cache is being nice,
> >> save/restore is faster than twiddling the TS bit. And the cost of
> >> the trap when TS is set blows everything else away.
> >
> > Interesting. Mind cooking up a delazying patch and measure it on
> > native as well? KVM generally makes exceptions more expensive, so the
> > effect of lazy exceptions might be less on native.
>
> Using the same patch on native, I get:
>
> kernel_fpu_begin/kernel_fpu_end (no userspace xstate): 71.53 ns
> stts/clts: 73 ns (clearly there's a bit of error here) userspace
> xstate with TS set: 229 ns
>
> So virtualization adds only a little bit of overhead.

KVM rocks.

> This isn't really a delazying patch -- it's two arch_prctls, one of
> them is kernel_fpu_begin();kernel_fpu_end(). The other is the same
> thing in a loop.
>
> The other numbers were already native since I measured them
> entirely in userspace. They look the same after rebooting.

I should have mentioned it earlier, but there's a certain amount of
delazying patches in the tip:x86/xsave branch:

$ gll linus..x86/xsave
300c6120b465: x86, xsave: fix non-lazy allocation of the xsave area
f79018f2daa9: Merge branch 'x86/urgent' into x86/xsave
66beba27e8b5: x86, xsave: remove lazy allocation of xstate area
1039b306b1c6: x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP)
4182a4d68bac: x86, xsave: add support for non-lazy xstates
324cbb83e215: x86, xsave: more cleanups
2efd67935eb7: x86, xsave: remove unused code
0c11e6f1aed1: x86, xsave: cleanup fpu/xsave signal frame setup
7f4f0a56a7d3: x86, xsave: rework fpu/xsave support
26bce4e4c56f: x86, xsave: cleanup fpu/xsave support

it's not in tip:master because the LWP bits need (much) more work to
be palatable - but we could spin them off and complete them as per
your suggestions if they are an independent speedup on modern CPUs.

> >> Which brings me to another question: what do you think about
> >> declaring some of the extended state to be clobbered by syscall?
> >> Ideally, we'd treat syscall like a regular function and clobber
> >> everything except the floating point control word and mxcsr. More
> >> conservatively, we'd leave xmm and x87 state but clobber ymm. This
> >> would let us keep the cost of the state save and restore down when
> >> kernel_fpu_begin is used in a syscall path and when a context
> >> switch happens as a result of a syscall.
> >>
> >> glibc does *not* mark the xmm registers as clobbered when it issues
> >> syscalls, but I suspect that everything everywhere that issues
> >> syscalls does it from a function, and functions are implicitly
> >> assumed to clobber extended state. (And if anything out there
> >> assumes that ymm state is preserved, I'd be amazed.)
> >
> > To build the kernel with sse optimizations? Would certainly be
> > interesting to try.
>
> I had in mind something a little less ambitious: making
> kernel_fpu_begin very fast, especially when used more than once.
> Currently it's slow enough to have spawned arch/x86/crypto/fpu.c,
> which is a hideous piece of infrastructure that exists solely to
> reduce the number of kernel_fpu_begin/end pairs when using AES-NI.
> Clobbering registers in syscall would reduce the cost even more,
> but it might require having a way to detect whether the most recent
> kernel entry was via syscall or some other means.
>
> Making the whole kernel safe for xstate use would be technically
> possible, but it would add about three cycles to syscalls (for
> vzeroall -- non-AVX machines would take a larger hit) and
> apparently about 57 ns to non-syscall traps. That seems worse than
> the lazier approach.

3 cycles per syscall is acceptable, if the average optimization
savings per syscall are better than 3 cycles - which is not
impossible at all: using more registers generally moves the pressure
away from GP registers and allows the compiler to be smarter.

(older CPUs with higher switching costs wouldnt want to run such
kernels, obviously.)

So it's very much worth trying, if only to get some hard numbers.

That would also turn the somewhat awkward way of how we use vector
operations in the crypto code into something more natural. In theory
you could write a crypto algorithm in C and the compiler would use
vector instructions and get a pretty good end result. (one can always
hope, right?)

But more importantly, doing that would push vector operations *way*
beyond the somewhat niche area of crypto/RAID optimizations.
User-space already saves/restores the vector registers so they have
already done much of the register switching cost - the kernel just
has to take advantage of that.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Neha Singhal: "Crash while using completions"
Previous message: Marcel Selhorst: "Re: [PATCH v2] char/tpm: Add new driver for Infineon I2C TIS TPM"
In reply to: Ingo Molnar: "Re: [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to"
Next in thread: Andrew Lutomirski: "Re: [RFC] syscall calling convention, stts/clts, and xstate latency"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]