Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation

From: Jerome Glisse
Date: Tue Dec 04 2018 - 16:16:38 EST


On Tue, Dec 04, 2018 at 01:47:17PM -0700, Logan Gunthorpe wrote:
>
>
> On 2018-12-04 1:14 p.m., Andi Kleen wrote:
> >> Also, in the same vein, I think it's wrong to have the API enumerate all
> >> the different memory available in the system. The API should simply
>
> > We need an enumeration API too, just to display to the user what they
> > have, and possibly for applications to size their buffers
> > (all we do with existing NUMA nodes)
>
> Yes, but I think my main concern is the conflation of the enumeration
> API and the binding API. An application doesn't want to walk through all
> the possible memory and types in the system just to get some memory that
> will work with a couple initiators (which it somehow has to map to
> actual resources, like fds). We also don't want userspace to police
> itself on which memory works with which initiator.

How application would police itself ? The API i am proposing is best
effort and as such kernel can fully ignore userspace request as it is
doing now sometimes with mbind(). So kernel always have the last call
and can always override application decission.

Device driver can also decide to override, anything that is kernel
side really have more power than userspace would have. So while we
give trust to userspace we do not abdicate control. That is not the
intention here.


> Enumeration is definitely not the common use case. And if we create a
> new enumeration API now, it may make it difficult or impossible to unify
> these types of memory with the existing NUMA node hierarchies if/when
> this gets more integrated with the mm core.

The point i am trying to make is that it can not get integrated as
regular NUMA node inside the mm core. But rather the mm core can
grow to encompass non NUMA node memory. I explained why in other
part of this thread but roughly:

- Device driver need to be in control of device memory allocation
for backward compatibility reasons and to keep full filling thing
like graphic API constraint (OpenGL, Vulkan, X, ...).

- Adding new node type is problematic inside mm as we are running
out of bits in the struct page

- Excluding node from the regular allocation path was reject by
upstream previously (IBM did post patchset for that IIRC).

I feel it is a safer path to avoid a one model fits all here and
to accept that device memory will be represented and managed in
a different way from other memory. I believe persistent memory
folks feels the same on that front.

Nonetheless i do want to expose this device memory in a standard
way so that we can consolidate and improve user experience on
that front. Eventually i hope that more of the device memory
management can be turn into a common device memory management
inside core mm but i do not want to enforce that at first as it
is likely to fail (building a moonbase before you have a moon
rocket). I rather grow organicaly from high level API that will
get use right away (it is a matter of converting existing user
to it s/computeAPIBind/HMSBind).

Cheers,
Jérôme