Re: Updated version of RD/WR FS/GS BASE patchkit

From: Andy Lutomirski
Date: Mon Mar 21 2016 - 15:41:39 EST


On Mon, Mar 21, 2016 at 9:16 AM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
> This is a reworked version of my older fsgsbase patchkit.
> Main changes:
> - Ported to new entry/* code, which simplified it somewhat
> - Now has a test program
> - Fixed ptrace/core dump support
> - Better documentation
> - Some minor fixes improvement

I think that the biggest remaining issue is to define the semantics.

As an architectural matter, the relevant user state is (fs selector,
fs base, gs selector, gs base). With FSGSBASE enabled, user code can
more or less independently control all four of those values. (It's
slightly more complicated than that because set_thread_area and
modify_ldt both forget to reload segment registers IIRC, but we can
fix that independently.)

Keeping in mind that we'll probably want to add percpu segment bases
at some point (to allow very fast atomic percpu data access for user
code), the questions I have are:

1a. What happens when a task switches out and back in on the same CPU?

1b. What happens when a task switches out and back in on a different CPU?

2a. What happens when a tracer reads the state out and writes exactly
the same thing back in and the task resumes on the CPU it started on?

2b. What happens when a tracer reads the state out and writes exactly
the same thing back in and the task resumes on a different CPU?

3. What happens if fs or gs points to a real descriptor and that
descriptor changes?

4. Does the sigcontext format need to change?

For maximum safely, comprehensibility, and sanity, there's an argument
to be made that 1a and 2a should leave the state exactly as it started
and that 1b and 2b should leave it alone unless percpu bases are in
use. For maximum simplicity of implementation, there's an argument
that, if the fs or gs selector is nonzero and the base doesn't match
the in-memory descriptor, then the kernel can do whatever it wants.

I propose the following semantics:

- All "save state" or "report state" events unconditionally save the
base and selector as they actually were in the CPU state. (Keep it
simple. Also, with these patches applied, on an FSGSBASE-capable CPU,
selector != 0 is a slow path.)

- When restoring state, if selector == 0, then the base is restored as it was.

- When restoring state, if selector != 0, then the base is restored
to whatever the in-memory descriptor says. (Optionally, down the
road, we could make it so that a save + restore without an intervening
migration, set_thread_area, or modify_ldt would restore the base as it
was. This would make things more predictable.)

- If/when we add percpu bases, they are associated with a nonzero selector.

The big open question is: should signal delivery and restore do
anything to the selectors or bases? I think that, by default, it
can't, but maybe we'll want an option to do it some day.

Does all this make sense? Do people agree with me?