Re: [PATCH 2/9] x86: Add support for rd/wr fs/gs base

From: Andy Lutomirski
Date: Mon Mar 21 2016 - 15:22:52 EST


On Mon, Mar 21, 2016 at 12:05 PM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
> On Mon, Mar 21, 2016 at 11:13:05AM -0700, Andy Lutomirski wrote:
>> On Mar 21, 2016 9:16 AM, "Andi Kleen" <andi@xxxxxxxxxxxxxx> wrote:
>> >
>> > From: Andi Kleen <ak@xxxxxxxxxxxxxxx>
>> >
>> > Introduction:
>> >
>> > IvyBridge added four new instructions to directly write the fs and gs
>> > 64bit base registers. Previously this had to be done with a system
>> > call to write to MSRs. The main use case is fast user space threading
>> > and switching the fs/gs registers quickly there. Another use
>> > case is having (relatively) cheap access to a new address
>> > register per thread.
>> >
>> > The instructions are opt-in and have to be explicitely enabled
>> > by the OS.
>> >
>> > For more details on how to use the instructions see
>> > Documentation/x86/fsgs.txt added in a followon patch.
>> >
>> > Paranoid exception path changes:
>> > ===============================
>> >
>> > The paranoid entry/exit code is used for any NMI like
>> > exception.
>> >
>> > Previously Linux couldn't support the new instructions
>> > because the paranoid entry code relied on the gs base never being
>> > negative outside the kernel to decide when to use swaps. It would
>> > check the gs MSR value and assume it was already running in
>> > kernel if negative.
>> >
>> > To get rid of this assumption we have to revamp the paranoid exception
>> > path to not rely on this. We can use the new instructions
>> > to get (relatively) quick access to the GS value, and use
>> > it directly to save/restore the GSBASE instead of using
>> > SWAPGS.
>> >
>> > This is also significantly faster than a MSR read, so will speed
>> > NMIs (useful for profiling)
>> >
>> > The kernel gs for the paranoid path is now stored at the
>> > bottom of the IST stack (so that it can be derived from RSP).
>> >
>> > The original patch compared the gs with the kernel gs and
>> > assumed that if it was identical, swapgs was not needed
>> > (and no user space processing was needed). This
>> > was nice and simple and didn't need a lot of changes.
>> >
>> > But this had the side effect that if a user process set its
>> > GS to the same as the kernel it may lose rescheduling
>> > checks (so a racing reschedule IPI would have been
>> > only acted upon the next non paranoid interrupt)
>> >
>> > This version now switches to full save/restore of the GS.
>> >
>> > When swapgs used to be needed, but we have the new
>> > instructions, we restore original GS value in the exit
>> > path.
>> >
>> > Context switch changes:
>> > ======================
>> >
>> > Then after these changes we need to also use the new instructions
>> > to save/restore fs and gs, so that the new values set by the
>> > users won't disappear. This is also significantly
>> > faster for the case when the 64bit base has to be switched
>> > (that is when GS is larger than 4GB), as we can replace
>> > the slow MSR write with a faster wr[fg]sbase execution.
>> >
>> > This is in term enables fast switching when there are
>> > enough threads that their TLS segment does not fit below 4GB
>> > (or with some newer systems which don't properly hint the
>> > stack limit), or alternatively programs that use fs as an additional base
>> > register will not get a sigificant context switch penalty.
>> >
>> > It is all done in a single patch because there was no
>> > simple way to do it in pieces without having crash
>> > holes inbetween.
>> >
>> > v2: Change to save/restore GS instead of using swapgs
>> > based on the value. Large scale changes.
>> > v3: Fix wrong flag initialization in fallback path.
>> > Thanks 0day!
>> > v4: Make swapgs code paths kprobes safe.
>> > Port to new base line code which now switches indexes.
>> > v5: Port to new kernel which avoids paranoid entry for ring 3.
>> > Removed some code that handled this previously.
>> > v6: Remove obsolete code. Use macro for ALTERNATIVE. Use
>> > ALTERNATIVE for exit path, eliminating the DO_RESTORE_G15 flag.
>> > Various cleanups. Improve description.
>> > v7: Port to new entry code. Some fixes/cleanups.
>> > v8: Lots of changes.
>> > Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
>> > ---
>> > arch/x86/entry/entry_64.S | 31 +++++++++++++++++++++++++++
>> > arch/x86/kernel/cpu/common.c | 9 ++++++++
>> > arch/x86/kernel/process_64.c | 51 ++++++++++++++++++++++++++++++++++++++------
>> > 3 files changed, 85 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> > index 858b555..c605710 100644
>> > --- a/arch/x86/entry/entry_64.S
>> > +++ b/arch/x86/entry/entry_64.S
>> > @@ -35,6 +35,8 @@
>> > #include <asm/asm.h>
>> > #include <asm/smap.h>
>> > #include <asm/pgtable_types.h>
>> > +#include <asm/alternative-asm.h>
>> > +#include <asm/fsgs.h>
>> > #include <linux/err.h>
>> >
>> > /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
>> > @@ -678,6 +680,7 @@ ENTRY(\sym)
>> > jnz 1f
>> > .endif
>> > call paranoid_entry
>> > + /* r15: previous gs if FSGSBASE, otherwise %ebx: swapgs flag */
>>
>> [...]
>>
>> The asm looks generally correct.
>>
>> > @@ -1422,8 +1425,14 @@ void cpu_init(void)
>> > */
>> > if (!oist->ist[0]) {
>> > char *estacks = per_cpu(exception_stacks, cpu);
>> > + void *gs = per_cpu(irq_stack_union.gs_base, cpu);
>> >
>> > for (v = 0; v < N_EXCEPTION_STACKS; v++) {
>> > + /* Store GS at bottom of stack for bootstrap access */
>> > + *(void **)estacks = gs;
>> > + /* Put it on every 4K entry */
>> > + if (exception_stack_sizes[v] > EXCEPTION_STKSZ)
>> > + *(void **)(estacks + EXCEPTION_STKSZ) = gs;
>>
>> What if it's more than 2x the normal size?
>
> Well it is not and cannot be. Is that a trick question?

It isn't, but I had to look at the header to find that out.
Presumably either the code should work no matter what the stack sizes
are or it should assert that the sizes are always either
EXCEPTION_STKSZ or 2*EXCEPTION_STKSZ.

--Andy