Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

From: Alexander Graf
Date: Mon May 02 2022 - 14:35:50 EST



On 02.05.22 20:04, Jason A. Donenfeld wrote:
Hey Lennart,

On Mon, May 02, 2022 at 06:51:19PM +0200, Lennart Poettering wrote:
On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@xxxxxxxxx) wrote:

In order to inform userspace of virtual machine forks, this commit adds
a "fork_event" sysctl, which does not return any data, but allows
userspace processes to poll() on it for notification of VM forks.

It avoids exposing the actual vmgenid from the hypervisor to userspace,
in case there is any randomness value in keeping it secret. Rather,
userspace is expected to simply use getrandom() if it wants a fresh
value.
Wouldn't it make sense to expose a monotonic 64bit counter of detected
VM forks since boot through read()? It might be interesting to know
for userspace how many forks it missed the fork events for. Moreover it
might be interesting to userspace to know if any fork happened so far
*at* *all*, by checking if the counter is non-zero.
"Might be interesting" is different from "definitely useful". I'm not
going to add this without a clear use case. This feature is pretty
narrowly scoped in its objectives right now, and I intend to keep it
that way if possible.
Sure, whatever. I mean, if you think it's preferable to have 3 API
abstractions for the same concept each for it's special usecase, then
that's certainly one way to do things. I personally would try to
figure out a modicum of generalization for things like this. But maybe
that' just me…

I can just tell you, that in systemd we'd have a usecase for consuming
such a generation counter: we try to provide stable MAC addresses for
synthetic network interfaces managed by networkd, so we hash them from
/etc/machine-id, but otoh people also want them to change when they
clone their VMs. We could very nicely solve this if we had a
generation counter easily accessible from userspace, that starts at 0
initially. Because then we can hash as we always did when the counter
is zero, but otherwise use something else, possibly hashed from the
generation counter.
This doesn't work, because you could have memory-A split into memory-A.1
and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
the same new value "2". The solution is to instead have the hypervisor
pass a unique value and a counter. We currently have a 16 byte unique
value from the hypervisor, which I'm keeping as a kernel space secret
for the RNG; we're waiting on a word-sized monotonic counter interface
from hypervisors in the future. When we have the latter, then we can
start talking about mmapable things. Your use case would probably be
served by exposing that 16-byte unique value (hashed with some constant
for safety I suppose), but I'm hesitant to start going down that route
all at once, especially if we're to have a more useful counter in the
future.


Michael, since we already changed the CID in the spec, can we add a property to the device that indicates the first 4 bytes of the UUID will always be different between parent and child?

That should give us the ability to mmap the vmgenid directly to user space and act based on a simple u32 compare for clone notification, no?


Thanks;

Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879