Re: [PATCH 16/26] cxl/extent: Realize extent devices

From: Ira Weiny
Date: Mon May 06 2024 - 00:36:01 EST


Dan Williams wrote:
> Ira Weiny wrote:
> > Jonathan Cameron wrote:
> > > On Sun, 24 Mar 2024 16:18:19 -0700
> > > ira.weiny@xxxxxxxxx wrote:
> > >
> > > > From: Navneet Singh <navneet.singh@xxxxxxxxx>
> > > >
> > > > Once all extents of an interleave set are present a region must
> > > > surface an extent to the region.
> > > >
> > > > Without interleaving; endpoint decoder and region extents have a 1:1
> > > > relationship. Future support for IW > 1 will maintain a N:1
> > > > relationship between the device extents and region extents.
> > > >
> > > > Create a region extent device for every device extent found. Release of
> > > > the extent device triggers a response to the underlying hardware extent.
> > > >
> > > > There is no strong use case to support the addition of extents which
> > > > overlap previously accepted extent ranges. Reject such new extents
> > > > until such time as a good use case emerges.
> > > >
> > > > Expose the necessary details of region extents by creating the following
> > > > sysfs entries.
> > > >
> > > > /sys/bus/cxl/devices/dax_regionX/extentY
> > > > /sys/bus/cxl/devices/dax_regionX/extentY/offset
> > > > /sys/bus/cxl/devices/dax_regionX/extentY/length
> > > > /sys/bus/cxl/devices/dax_regionX/extentY/label
> > >
> > > Docs?
> >
> > That is a good idea.
> >
> > > The label in particular worries me a little as I'm not sure what
> > > is in it.
> >
> > I envisioned a pass through of the tag.
> >
> > >
> > > If it's the tag one possible format is a uuid (not a coincidence
> > > that it is the same length) and interpreting that as characters isn't
> > > going to get us far. I wonder if we have to treat it as a binary attr
> > > given we have no idea what it is.
> >
> > In thinking about this more (and running some experiments): none of these are
> > strictly necessary in this initial implementation. No code currently uses
> > them directly.
> >
> > I questioned these in the past and I've done so again over the weekend.
> >
> > I was about to rip them out entirely when I remembered Gregory Price's
> > comments on Discord. There he indicating a desire to very carefully place
> > dax devices. Without at least the offset and length above (and to a
> > lesser extent the label) this can't be done.
>
> Careful placement of dax-devices physically requires an entirely new
> allocation ABI. There is the mapping_store() interface that was added
> for a specific kexec / VMM fast restore use case, but that never
> envisioned the sparse region case. So I do think it is worthwhile to
> punt on that question to a later add-on feature.

Agreed.

>
> [..]
> > I don't think this is exactly what the user is going to expect. This can
> > be resolved by by looking at the dax device mappings though.[0] So I'm going
> > to leave this for now. But I expect some additional porcelain is going to
> > be required to fully meet Gregory's requirements.
>
> Not sure what the exact requirement is, but if it's the typical, "I want
> to allocate by tag",

I'm extrapolating that it will be. I want to allocate on the first extent.
Tags were removed from the series a while ago.

> then I think there is another potential coarse
> grained solution that probably covers most cases. Allow multiple
> dax_regions per cxl_dcd_regions, where each dax_region manages an
> exclusive set of tags.

I'll have to think on that because don't dax regions map to specific dpas on
creation?

>
> The host negotiates a dax_region tag layout with the orchestrator and
> can then trust that all of the extents that show up in a given dax_region
> belong to a given tag or set of tags.
>
> This is not something that needs to be considered in the initial
> enabling, but is potentially a way to avoid bolting-on a new fine grained
> allocation api after the fact.
>

The point is I'm not trying to bolt anything on. Just trying to explain what
can and can't be done. The purpose of these entries was to give the user the
ability to see what extents existed and by correlating the dax mappings could
see where their dax mappings landed. Careful allocation of dax devices could
result in the use of some extents and not others. But this was __not__ at all
intended to be done initially. Just use the space as it come available without
any tag use at all.

>
> > [0]
> > /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
> > /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
> >
> > Back to the label field: It is currently just the 'tag' of the individual
> > extent (because no interleaving). My vision for the interleave case would
> > be for the kernel to assemble device extents into a region extent only if
> > the tags match and export that.
> >
> > Thinking on it more though we should leave label out for now. This is the
> > second time it has been questioned.
>
> I don't understand the issue. That is a critical piece of information
> and it is at the cxl device level
>
> /sys/bus/cxl/devices/dax_regionX/extentY/label
>
> ...now I would just call that "tag" and UUID format it (to Jonathan's
> point), but I see no rationale to hide what is most likely the most
> useful information about an extent.

The rationale is that the user was not going to use it. So no use case == no
reason to have it... yet. I had code a while back to allocate dax devices on
specific tags. But that was deemed to different from the current dax
allocation mechanism so it was scrapped... for now.

I can add it back... But I'm just getting a bit testy about who wants what and
how this is all going to get used.

Ira