Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity

From: Jason Gunthorpe
Date: Wed Sep 29 2021 - 14:26:50 EST


On Wed, Sep 29, 2021 at 12:06:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Sep 2021 13:16:02 -0300
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> > On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote:
> > > On Tue, 28 Sep 2021 16:35:50 -0300
> > > Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > >
> > > > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:
> > > >
> > > > > In defining the device state, we tried to steer away from defining it
> > > > > in terms of the QEMU migration API, but rather as a set of controls
> > > > > that could be used to support that API to leave us some degree of
> > > > > independence that QEMU implementation might evolve.
> > > >
> > > > That is certainly a different perspective, it would have been
> > > > better to not express this idea as a FSM in that case...
> > > >
> > > > So each state in mlx5vf_pci_set_device_state() should call the correct
> > > > combination of (un)freeze, (un)quiesce and so on so each state
> > > > reflects a defined operation of the device?
> > >
> > > I'd expect so, for instance the implementation of entering the _STOP
> > > state presumes a previous state that where the device is apparently
> > > already quiesced. That doesn't support a direct _RUNNING -> _STOP
> > > transition where I argued in the linked threads that those states
> > > should be reachable from any other state. Thanks,
> >
> > If we focus on mlx5 there are two device 'flags' to manage:
> > - Device cannot issue DMAs
> > - Device internal state cannot change (ie cannot receive DMAs)
> >
> > This is necessary to co-ordinate across multiple devices that might be
> > doing peer to peer DMA between them. The whole multi-device complex
> > should be moved to "cannot issue DMA's" then the whole complex would
> > go to "state cannot change" and be serialized.
>
> Are you anticipating p2p from outside the VM? The typical scenario
> here would be that p2p occurs only intra-VM, so all the devices would
> stop issuing DMA (modulo trying to quiesce devices simultaneously).

Inside the VM.

Your 'modulo trying to quiesce devices simultaneously' is correct -
this is a real issue that needs to be solved.

If we put one device in a state where it's internal state is immutable
it can no longer accept DMA messages from the other devices. So there
are two states in the HW model - do not generate DMAs and finally the
immutable internal state where even external DMAs are refused.

> > The expected sequence at the device is thus
> >
> > Resuming
> > full stop -> does not issue DMAs -> full operation
> > Suspend
> > full operation -> does not issue DMAs -> full stop
> >
> > Further the device has two actions
> > - Trigger serializating the device state
> > - Trigger de-serializing the device state
> >
> > So, what is the behavior upon each state:
> >
> > * 000b => Device Stopped, not saving or resuming
> > Does not issue DMAs
> > Internal state cannot change
> >
> > * 001b => Device running, which is the default state
> > Neither flags
> >
> > * 010b => Stop the device & save the device state, stop-and-copy state
> > Does not issue DMAs
> > Internal state cannot change
> >
> > * 011b => Device running and save the device state, pre-copy state
> > Neither flags
> > (future, DMA tracking turned on)
> >
> > * 100b => Device stopped and the device state is resuming
> > Does not issue DMAs
> > Internal state cannot change
>
> cannot change... other that as loaded via migration region.

Yes

> The expected protocol is that if the user write to the device_state
> register returns an errno, the user reevaluates the device_state to
> determine if the desired transition is unavailable (previous state
> value is returned) or generated a fault (error state value
> returned).

Hmm, interesting, mlx5 should be doing this as well. Eg resuming with
corrupt state should fail and cannot be recovered except via reset.

> The 101b state indicates _RUNNING while _RESUMING, which is simply not
> a mode that has been spec'd at this time as it would require some
> mechanism for the device to fault in state on demand.

So lets error on these requests since we don't know what state to put
the device into.

> > The two actions:
> > trigger serializing the device state
> > Done when asked to go to 010b ?
>
> When the _SAVING bit is set. The exact mechanics depends on the size
> and volatility of the device state. A GPU might begin in pre-copy
> (011b) to transmit chunks of framebuffer data, recording hashes of
> blocks read by the user to avoid re-sending them during the
> stop-and-copy (010b) phase.

Here I am talking specifically about mlx5 which does not have a state
capture in pre-copy. So mlx5 should capture state on 010b only, and
the 011b is a NOP.

> > trigger de-serializing the device state
> > Done when transition from 100b -> 000b ?
>
> 100b -> 000b is not a required transition, generally this would be 100b
> -> 001b, ie. end state of _RUNNING vs _STOP.

Sorry, I typo'd it, yes to _RUNNING

> I think the requirement is that de-serialization is complete when the
> _RESUMING bit is cleared. Whether the driver chooses to de-serialize
> piece-wise as each block of data is written to the device or in bulk
> from a buffer is left to the implementation. In either case, the
> driver can fail the transition to !_RESUMING if the state is incomplete
> or otherwise corrupt. It would again be the driver's discretion if
> the device enters the error state or remains in _RESUMING. If the user
> has no valid state with which to exit the _RESUMING phase, a device
> reset should return the device to _RUNNING with a default initial state.

That makes sense enough.

> > There is a missing state "Stop Active Transactions" which would be
> > only "does not issue DMAs". I've seen a proposal to add that.
>
> This would be to get all devices to stop issuing DMA while internal
> state can be modified to avoid the synchronization issue of trying to
> stop devices concurrently?

Yes, as above

> For PCI devices we obviously have the bus master bit to manage that,
> but I could see how a migration extension for such support (perhaps
> even just wired through to BM for PCI) could be useful. Thanks,

I'm nervous to override the BM bit for something like this, the BM bit
isn't a gentle "please coherently stop what you are doing" it is a
hanbrake the OS pulls to ensure any PCI device becomes quiet.

Thanks,
Jason