Re: [PATCH v4 00/30] Live Update Orchestrator

From: Jason Gunthorpe

Date: Fri Oct 10 2025 - 11:01:39 EST


On Thu, Oct 09, 2025 at 07:50:12PM -0400, Pasha Tatashin wrote:
> > This can look something like:
> >
> > hugetlb_luo_preserve_folio(folio, ...);
> >
> > Nice and simple.
> >
> > Compare this with the new proposed API:
> >
> > liveupdate_fh_global_state_get(h, &hugetlb_data);
> > // This will have update serialized state now.
> > hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
> > liveupdate_fh_global_state_put(h);
> >
> > We do the same thing but in a very complicated way.
> >
> > - When the system-wide preserve happens, the hugetlb subsystem gets a
> > callback to serialize. It converts its runtime global state to
> > serialized state since now it knows no more FDs will be added.
> >
> > With the new API, this doesn't need to be done since each FD prepare
> > already updates serialized state.
> >
> > - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
> > anything in LUO. This is same as new API.
> >
> > - If some hugetlb FDs are not restored after liveupdate and the finish
> > event is triggered, the subsystem gets its finish() handler called and
> > it can free things up.
> >
> > I don't get how that would work with the new API.
>
> The new API isn't more complicated; It codifies the common pattern of
> "create on first use, destroy on last use" into a reusable helper,
> saving each file handler from having to reinvent the same reference
> counting and locking scheme. But, as you point out, subsystems provide
> more control, specifically they handle full creation/free instead of
> relying on file-handlers for that.

I'd say hugetlb *should* be doing the more complicated thing. We
should not have global static data for luo floating around the kernel,
this is too easily abused in bad ways.

The above "complicated" sequence forces the caller to have a fd
session handle, and "hides" the global state inside luo so the
subsystem can't just randomly reach into it whenever it likes.

This is a deliberate and violent way to force clean coding practices
and good layering.

Not sure why hugetlb pools would need another xarray??

1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
frozen, can't add/remove PFNs.
2) Require the users of hugetlb memory, like memfd, to
preserve/restore the folios they are using (using their hugetlb order)
3) Just before kexec run over the PFN list and mark a bit if the folio
was preserved by KHO or not. Make sure everything gets KHO
preserved.

Restore puts the PFNs that were not preserved directly in the free
pool, the end user of the folio like the memfd restores and eventually
normally frees the other folios.

It is simple and fits nicely into the infrastructure here, where the
first time you trigger a global state it does the pfn list and
freezing, and the lifecycle and locking for this operation is directly
managed by luo.

The memfd, when it knows it has hugetlb folios inside it, would
trigger this.

Jason