Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver

From: Alexander Graf
Date: Mon Dec 07 2020 - 09:23:20 EST




On 27.11.20 21:20, Jann Horn wrote:

On Fri, Nov 27, 2020 at 8:04 PM Catangiu, Adrian Costin
<acatan@xxxxxxxxxx> wrote:
On 27/11/2020 20:22, Jann Horn wrote:
On Fri, Nov 20, 2020 at 11:29 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
<acatan@xxxxxxxxxx> wrote:
This patch is a driver that exposes a monotonic incremental Virtual
Machine Generation u32 counter via a char-dev FS interface that
provides sync and async VmGen counter updates notifications. It also
provides VmGen counter retrieval and confirmation mechanisms.

The hw provided UUID is not exposed to userspace, it is internally
used by the driver to keep accounting for the exposed VmGen counter.
The counter starts from zero when the driver is initialized and
monotonically increments every time the hw UUID changes (the VM
generation changes).

On each hw UUID change, the new hypervisor-provided UUID is also fed
to the kernel RNG.
As for v1:

Is there a reasonable usecase for the "confirmation" mechanism? It
doesn't seem very useful to me.

I think it adds value in complex scenarios with multiple users of the
mechanism, potentially at varying layers of the stack, different
processes and/or runtime libraries.

The driver offers a natural place to handle minimal orchestration
support and offer visibility in system-wide status.

A high-level service that trusts all system components to properly use
the confirmation mechanism can actually block and wait patiently for the
system to adjust to the new world. Even if it doesn't trust all
components it can still do a best-effort, timeout block.

What concrete action would that high-level service be able to take
after waiting for such an event?

For us, it would only allow incoming requests to the target container after the container has successfully adjusted.

You can think of other models too. Your container orchestration engine could prevent network traffic to reach the container applications until the full container is adjusted for example.

My model of the vmgenid mechanism is that RNGs and cryptographic
libraries that use it need to be fundamentally written such that it is
guaranteed that a VM fork can not cause the same random number /
counter / ... to be reused for two different cryptographic operations
in a way visible to an attacker. This means that e.g. TLS libraries
need to, between accepting unencrypted input and sending out encrypted
data, check whether the vmgenid changed since the connection was set
up, and if a vmgenid change occurred, kill the connection.

Can you give a concrete example of a usecase where the vmgenid
mechanism is used securely and the confirmation mechanism is necessary
as part of that?

The main crux here is that we have 2 fundamental approaches:

1) Transactional

For an RNG, the natural place to adjust yourself to a resumed snapshot is in the random number retrieval. You just check if your generation is still identical when you fetch the next random number.

Ideally, you also do the same for anything consuming such a random number. So random number retrieval would no longer just return ( number ), but instead ( number, generation ). That way you could check at every consumer side of the random number that it's actually still random. That would need to cascade down.

So every key you derive from a random number, every uuid you generate, they all would need to store the generation as well and compare if the current generation is still the same as when they were generated. That means you need to convert every data access method to a function call that checks if the value is still consumable and if not, able to regenerate it. The same applies for global values, such as a system global UUID that is shared among multiple processes.

If you slowly move away from super integrated environments like a TLS library and start thinking of samba system UUIDs or SSH host keys, you'll quickly see how that approach reaches its limits.


2) Event based

Let's take a look at the complicated things to implement with the transactional approach: samba system UUIDs, SSH host keys, global variables in a random Java application that get initialized on application start.

All of these are very easy to resolve through an event based mechanism. Based on the "new generation" event, you can just generate a new UUID. Or a new host key. All you would need to know for this to be non-racy is that before you actually use the target services, you know they are self-adjusted. In most container workloads, that can be achieved by not letting network traffic go in before the event is fully processed.


What this patch set does is provide both: We allow the transactional approach through mmap() of a shared page to be implemented for stacks where that's easiest. You can use that when your logic is realistically convertable to transactional. We also allow for an asynchronous event, which can be used in environments where the transactional approach is hard because of design constraints (language, API, system, etc.).

Combining the two, you get the best of both worlds IMHO.


How do you envision integrating this with libraries that have to work
in restrictive seccomp sandboxes? If this was in the vDSO, that would
be much easier.

Since this mechanism targets all of userspace stack, the usecase greatly
vary. I doubt we can have a single silver bullet interface.

For example, the mmap interface targets user space RNGs, where as fast
and as race free as possible is key. But there also higher level
applications that don't manage their own memory or don't have access to
low-level primitives so they can't use the mmap or even vDSO interfaces.
That's what the rest of the logic is there for, the read+poll interface
and all of the orchestration logic.

Are you saying that, because people might not want to write proper
bindings for this interface while also being unwilling to take the
performance hit of calling read() in every place where they would have
to do so to be fully correct, you want to build a "best-effort"
mechanism that is deliberately designed to allow some cryptographic
state reuse in a limited time window?

I seriously hope that for crypto, we will always(?) be able to use the transactional approach. And there we don't even have to resort to read() - you can just mmap() the generation ID.

What the event based mechanism is there for are the other cases that are not easily converted to such an approach. As library owner, you always have the choice.

That said, I don't think the "best-effort" mechanism is as bad as you describe it above. If you're thinking of a normal VM image, imagine systemd would implement vmgenid support. It could install a quick BPF program that just blocks all network traffic altogether from the VM until its genid is fully synchronized across all processes. Ideally, it would even be able to kill uncooperative processes eventually, so that your resumed VM is reachable after a timeout.

Like you correctly point out, there are also scenarios like tight
seccomp jails where even the FS interfaces is inaccessible. For cases
like this and others, I believe we will have to work incrementally to
build up the interface diversity to cater to all the user scenarios
diversity.

It would be much nicer if we could have one simple interface that lets
everyone correctly do what they need to, though...

I want a pony too :). We need to do what's best for our users here. I am not convinced only offering a transaction based interface is going to find the adoption we're hoping for. That means, we'll end up less secure than we want to, not more.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879