Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

From: Dave Hansen
Date: Mon Jun 18 2018 - 08:47:45 EST


On 06/16/2018 05:33 PM, Keno Fischer wrote:
> For my use case, it would be sufficient to simply disallow
> any value of XCR0 with "holes" in it,
But what if the hardware you are migrating to/from *has* holes?

There's no way this is even close to viable until it has been made to
cope with holes.

FWIW, I just don't think this is going to be viable. I have the feeling
that there's way too much stuff that hard-codes assumptions about XCR0
inside the kernel and out. This is just going to make it much more fragile.

Folks that want this level of container migration are probably better
off running one of the hardware-based containers and migrating _those_.
Or, just ensuring the places to/from they want to migrate have a
homogeneous XCR0 mix.

> @@ -252,6 +301,8 @@ void arch_setup_new_exec(void)
> /* If cpuid was previously disabled for this task, re-enable it. */
> if (test_thread_flag(TIF_NOCPUID))
> enable_cpuid();
> + if (test_thread_flag(TIF_MASKXCR0))
> + reset_xcr0_mask();
> }

So the mask is cleared on exec(). Does that mean that *every*
individual process using this interface has to set up its own mask
before anything in the C library establishes its cached value of XCR0.
I'd want to see how that's being accomplished.

> +static int xstate_is_initial(unsigned long mask)
> +{
> + int i, j;
> + unsigned long max_bit = __ffs(mask);
> +
> + for (i = 0; i < max_bit; ++i) {
> + if (mask & (1 << i)) {
> + char *xfeature_addr = (char *)get_xsave_addr(
> + &current->thread.fpu.state.xsave,
> + 1 << i);
> + unsigned long feature_size = xfeature_size(i);
> +
> + for (j = 0; j < feature_size; ++j) {
> + if (xfeature_addr[j] != 0)
> + return 0;
> + }
> + }
> + }
> + return 1;
> +}

There is nothing architectural saying that the init state has to be 0.

> + case ARCH_SET_XCR0: {

The interface is a mit burky. The SET_XCR0 operation masks out the
"set" value from the current value? That's a bit counterintuitive.

> + unsigned long mask = xfeatures_mask & ~arg2;
> +
> + if (!use_xsave())
> + return -ENODEV;
> +
> + if (arg2 & ~xfeatures_mask)
> + return -ENODEV;

This is rather unfortunately comment-free. "Are you trying to clear a
bit that was not set in the first place?"

Also, shouldn't this be dealing with the new task->xcr0, *not* the
global xfeatures_mask? What if someone calls this more than once?

> + if (!xcr0_is_legal(arg2))
> + return -EINVAL;

FWIW, I don't really get the point of disallowing some of the values
made illegal in there. Sure, you shoot yourself in the foot, but the
worst you'll probably see is a general-protection-fault from the XSETBV,
or from the first XRSTOR*. We can cope with those, and I'd rather not
be trying to keep a list of things you're not allowed to do with XSAVE.

I also don't see any sign of checking for supervisor features anywhere.

> + /*
> + * We require that any state components being disabled by
> + * this prctl be currently in their initial state.
> + */
> + if (!xstate_is_initial(mask))
> + return -EPERM;

Aside: I would *not* refer to the "initial state", for fear that we
could confuse it with the hardware-defined "init state". From software,
we really have zero control over when the hardware is in its "init state".

But, in any case, so how is this supposed to work?

// get features we are disabling into values matching the
// hardware "init state".
__asm__("XRSTOR %reg1,%reg2", ...);
prctl(PRCTL_SET_XCR0, something);

?

That would be *really* fragile code from userspace. Adding a printk()
between those two lines would probably break it, for instance.

I'd probably just not have these checks.