Re: [Patch v5 11/16] x86/speculation: Add Spectre v2 app to app protection modes

From: Andrea Arcangeli
Date: Mon Nov 19 2018 - 16:40:18 EST


On Mon, Nov 19, 2018 at 08:39:41PM +0100, Jiri Kosina wrote:
> On Mon, 19 Nov 2018, Andrea Arcangeli wrote:
>
> > Generally speaking the untrusted code that would try to use spectrev2
> > to attack the other processes is more likely to run inside SECCOMP
> > jail than outside, so if SECCOMP should be used as a best effort
> > heuristic to decide when to enable STIBP, it would make more sense to
> > enable STIBP outside SECCOMP, and not inside. I.e. the exact opposite
> > of what you're proposing above.
>
> Hmm, that's a very good point. But I actually don't see why both
> directions wouldn't be possible real-blackhat-world scenarios. So perhaps
> we'd want, under the basic asumption that "SECCOMP should really be
> sandboxed from outisde interventions and from causing them from inside as
> well", flush on both switch-to-seccomp and switch-from-seccomp?

STIBP doesn't flush so I don't see how "flush" and "switch" fits the
STIBP discussion.

Flush as in IPBP on switch-to-seccomp and switch-from-seccomp? IBPB is
not going to solve the HT attack and STIBP is only about the HT
attack. IBPB only solves the user-to-user context switch attack.

I just don't see SECCOMP as a good fit for a default-on heuristic
because there would be more arguments to enable STIBP outside seccomp
than inside and even if you ignore that, SECCOMP is used by pretty
much everything including wrapping through containers and systemd so
it would still leave lots of software running with STIBP (and for all
the wrong reasons too).

As opposed the not dumpable was a much better fit for a per-process
enablement heuristic, because the not dumpable code is more likely to
be the one that needs protection from attack and it's less likely to
be the very malicious code that got exploited (or was untrusted to
begin with like DRM binary blobs or public cloud usages). However like
mentioned in this thread suid calls can set the non dumpable flag, so
it's not ideal either. We'd need to track which processes turned off
the not dumpable flag with SUID_DUMP_DISABLE explicitly.

> So if I understand you correctly, what you are proposing here is to keep
> the current code, but just switch the default, and make it
> runtime/boottime togglable?

Deciding the default on this stuff is nightmarish, there's no good
default and the best system-wide default is data and workload
dependent.

And this is precisly why this should be runtime toggable and not just
boot-time toggable in my view.

I don't disagree with default disabled, that may be safer to avoid
breaking workloads near full capacity (same reason for why HT isn't
disabled by default for L1TF), we've to draw a line somewhere with the
default.

The ASLR argument from Tim's patchset cover letter combined with PID
namespaces should go a long way to mitigate the HT attack too even
without STIBP.

In my understanding you need to know what's running on the sibling
thread to derandomize ASLR, otherwise you'd be potentially attacking
glibc or some lib that yes is always mapped by all processes but it's
not mapped at the same address in all processes. You need to restrict
the measurement during ASLR derandomization to the exact time there's
the target process running in the sibling (any thread of the process
would be good).

Now assuming there's no pid namespace that prevents to see what's
running on the sibling thread, it depends on the scheduling jittering
and on the size and hw hashfn of the BTB (which varies across CPUs)
how hard it is to derandomize ASLR. According to the original paper,
some non-Linux OS has many low significant bits of their ASLR not
randomized and the high bits don't go into the hashfn of the BTB
(incidentally the ASLR derandomization technique to attack userland is
apparently not tested for Linux). We should be randomizing all bits
down to bit 12 (not bit 15), so for us the derandomizing should be 256
times more expensive? (At least until the day we map .text into
filesystem THP pagecache...) The more bits randomized that are part of
the BTB hashfn input, the more computational expensive it becomes to
derandomize ASLR, the more the random scheduling jittering will
interfere with the longer measurement, so hopefully the complexity of
the attack grows more than linearly with the number of random ASLR low
bits that gets into the BTB hashfn input. This is an optimistic guess
though.

Overall for on-prem cloud usages where no random malicious code can
run in the CPU by design, and this is only a post-exploitation
robustness issue, it doesn't seem a major concern if STIBP is disabled
if pid namespaces and ASLR have been fully leveraged by default in
Kubernetes containers. I'm curious to hear other people opinion on
this too however.

Downstream we always provided ibrs_enabled=2/3 which already implies
STIBP implicitly enabled at all times too, and that unlike STIBP
alone, also protects against guest attack on host userland too within
the same context. It's tunable at runtime. It's not enabled by default
for similar considerations as above for STIBP. I think it's good to
give the users the choice to be 100% secure against everything as an
opt-in (ideally requiring a reboot, that actually helps the evaluation
of the performance impact, which is obviously workload dependent too).

As an alternative to STIBP it would also be possible to alter the
scheduler so it never runs different processes in different siblings
of the same core, unless they can ptrace each other (same exact ptrace
check as the one to decide if to run IBPB to protect against the
context switch spectrev2 attack, except it needs to be checked in both
directions here). This way you could still have zero penalty for all
kernel builds and all threaded programs etc.. while still retaining
full security against the HT attack (and IBPB takes care of the rest
with the same ptrace check). With several containerized single
threaded workloads it would be slower than STIBP though by leaving all
siblings idle. However we could also let any process under pid
namespace to run along with any other processes under pid namespaces
even if they cannot ptrace each other to take care of that detail. Not
sure if it's worth it, but it remains a possibility that may perform
better than STIBP. It would also take care in general of cache attacks
on non-constant time algorithms etc.. not just spectrev2 HT attack.

Thanks,
Andrea