Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

From: Roger Pau MonnÃ
Date: Fri Feb 21 2020 - 04:47:45 EST


On Fri, Feb 21, 2020 at 12:49:18AM +0000, Anchal Agarwal wrote:
> On Thu, Feb 20, 2020 at 10:01:52AM -0700, Durrant, Paul wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monnà <roger.pau@xxxxxxxxxx>
> > > Sent: 20 February 2020 16:49
> > > To: Durrant, Paul <pdurrant@xxxxxxxxxxxx>
> > > Cc: Agarwal, Anchal <anchalag@xxxxxxxxxx>; Valentin, Eduardo
> > > <eduval@xxxxxxxxxx>; len.brown@xxxxxxxxx; peterz@xxxxxxxxxxxxx;
> > > benh@xxxxxxxxxxxxxxxxxxx; x86@xxxxxxxxxx; linux-mm@xxxxxxxxx;
> > > pavel@xxxxxx; hpa@xxxxxxxxx; tglx@xxxxxxxxxxxxx; sstabellini@xxxxxxxxxx;
> > > fllinden@xxxxxxxxxx; Kamata, Munehisa <kamatam@xxxxxxxxxx>;
> > > mingo@xxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxxx; Singh, Balbir
> > > <sblbir@xxxxxxxxxx>; axboe@xxxxxxxxx; konrad.wilk@xxxxxxxxxx;
> > > bp@xxxxxxxxx; boris.ostrovsky@xxxxxxxxxx; jgross@xxxxxxxx;
> > > netdev@xxxxxxxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx; rjw@xxxxxxxxxxxxx;
> > > linux-kernel@xxxxxxxxxxxxxxx; vkuznets@xxxxxxxxxx; davem@xxxxxxxxxxxxx;
> > > Woodhouse, David <dwmw@xxxxxxxxxxxx>
> > > Subject: Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks
> > > for PM suspend and hibernation
> > > For example one necessary difference will be that xenbus initiated
> > > suspend won't close the PV connection, in case suspension fails. On PM
> > > suspend you seem to always close the connection beforehand, so you
> > > will always have to re-negotiate on resume even if suspension failed.
> > >
> I don't get what you mean, 'suspension failure' during disconnecting frontend from
> backend? [as in this case we mark frontend closed and then wait for completion]
> Or do you mean suspension fail in general post bkacend is disconnected from
> frontend for blkfront?

I don't think you strictly need to disconnect from the backend when
suspending. Just waiting for all requests to finish should be enough.

This has the benefit of not having to renegotiate if the suspension
fails, and thus you can recover from suspension faster in case of
failure. Since you haven't closed the connection with the backend just
unfreezing the queues should get you working again, and avoids all the
renegotiation.

> In case of later, if anything fails after the dpm_suspend(),
> things need to be thawed or set back up so it should ok to always
> re-negotitate just to avoid errors.
>
> > > What I'm mostly worried about is the different approach to ring
> > > draining. Ie: either xenbus is changed to freeze the queues and drain
> > > the shared rings, or PM uses the already existing logic of not
> > > flushing the rings an re-issuing in-flight requests on resume.
> > >
> >
> > Yes, that's needs consideration. I donât think the same semantic can be suitable for both. E.g. in a xen-suspend we need to freeze with as little processing as possible to avoid dirtying RAM late in the migration cycle, and we know that in-flight data can wait. But in a transition to S4 we need to make sure that at least all the in-flight blkif requests get completed, since they probably contain bits of the guest's memory image and that's not going to get saved any other way.
> >
> > Paul
> I agree with Paul here. Just so as you know, I did try a hacky way in the past
> to re-queue requests in the past and failed miserably.

Well, it works AFAIK for xenbus initiated suspension, so I would be
interested to know why it doesn't work with PM suspension.

> I doubt[just from my experimentation]re-queuing the requests will work for PM
> Hibernation for the same reason Paul mentioned above unless you give me pressing
> reason why it should work.

My main reason is that I don't want to maintain two different
approaches to suspend/resume without a technical argument for it. I'm
not happy to take a bunch of new code just because the current one
doesn't seem to work in your use-case.

That being said, if there's a justification for doing it differently
it needs to be stated clearly in the commit. From the current commit
message I didn't gasp that there was a reason for not using the
current xenbus suspend/resume logic.

> Also, won't it effect the migration time if we start waiting for all the
> inflight requests to complete[last min page faults] ?

Well, it's going to dirty pages that would have to be re-send to the
destination side.

Roger.