Re: New subsystem for acceleration devices

From: Oded Gabbay
Date: Mon Aug 08 2022 - 16:26:46 EST


On Mon, Aug 8, 2022 at 8:46 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> On Sun, Aug 07, 2022 at 09:43:40AM +0300, Oded Gabbay wrote:
>
> > 1. If there is a subsystem which is responsible for creating and
> > exposing the device character files, then there should be some code
> > that connects between each device driver to that subsystem.
> > i.e. There should be functions that each driver should call from its
> > probe and release callback functions.
> >
> > Those functions should take care of the following:
> > - Create metadata for the device, the device's minor(s) and the
> > driver's ioctls table and driver's callback for file operations (both
> > are common for all the driver's devices). Save all that metadata with
> > proper locking.
> > - Create the device char files themselves and supply file operations
> > that will be called per each open/close/mmap/etc.
> > - Keep track of all these objects' lifetime in regard to the device
> > driver's lifetime, with proper handling for release.
> > - Add common handling and entries of sysfs/debugfs for these devices
> > with the ability for each device driver to add their own unique
> > entries.
> >
> > 2. I think that you underestimate (due to your experience) the "using
> > it properly" part... It is not so easy to do this properly for
> > inexperienced kernel people. If we provide all the code I mentioned
> > above, the device driver writer doesn't need to be aware of all these
> > kernel APIs.
>
> This may be, but it still seems weird to me to justify a subsystem as
> "making existing APIs simpler so drivers don't mess them up". It
> suggests perhaps we need additional core API helpers?
>
I'm sorry, but my original argument was poorly phrased. I'll try to
phrase it better.

What I'm trying to say is that imo one of the end goals of doing a
common subsystem is to provide a common uAPI for all the drivers that
belong to that subsystem.
I wrote this argument in a previous email as a criteria whether a
driver should join a subsystem.

So if you want a common uAPI and a common userspace library to use it,
you need to expose the same device character files for every device,
regardless of the driver. e.g. you need all devices to be called
/dev/accelX and not /dev/habanaX or /dev/nvidiaX

This means that the whole device character creation/removal/open/close
is done in the common subsystem, not in each driver individually.
So even if it is a simple code as you said, it still must reside in
the subsystem common code.

Once you put that code there, you need to add meta-data as different
drivers attach to the subsystem and ask to create devices and minors
when their probe function is called. In addition, you need to remove
all this code from each individual driver.

That's what I mean by abstracting all this kernel API from the
drivers. Not because it is an API that is hard to use, but because the
drivers should *not* use it at all.

I think drm did that pretty well. Their code defines objects for
driver, device and minors, with resource manager that will take care
of releasing the objects automatically (it is based on devres.c).

> > > It would be nice to at least identify something that could obviously
> > > be common, like some kind of enumeration and metadata kind of stuff
> > > (think like ethtool, devlink, rdma tool, nvemctl etc)
> > Definitely. I think we can have at least one ioctl that will be common
> > to all drivers from the start.
>
> Generally you don't want that as an ioctl because you have to open the
> device to execute it, there may be permissions issues for instance -
> or if you have a single-open-at-a-time model like VFIO, then it
> doesn't work well together.
>
> Usually this would be sysfs or netlink.
So actually I do want an ioctl but as you said, not for the main
device char, but to an accompanied control device char.

In habana we define two device char per device. One is the compute
which behaves like VFIO, and one is a control device which has no
limitation on the number of applications that can access it. However,
an application only has access to the information ioctl through this
device char (so it can't submit anything, allocate memory, etc.) and
can only retrieve metrics which do not leak information about the
compute application.

The reason I want an ioctl is because it is much more flexible than
sysfs and allows writing proper software to monitor the device in the
data-center. At least, that's my experience from the deployment we had
so far.

>
> > > This makes sense to me, all accelerators need a way to set a memory
> > > map, but on the other hand we are doing some very nifty stuff in this
> > > area with iommufd and it might be very interesting to just have the
> > > accelerator driver link to that API instead of building yet another
> > > copy of pin_user_pages() code.. Especially with PASID likely becoming
> > > part of any accelerator toolkit.
> >
> > Here I disagree with you. First of all, there are many relatively
> > simple accelerators, especially in edge, where PASID is really not
> > relevant.
> > Second, even for the more sophisticated PCIe/CXL-based ones, PASID is
> > not mandatory and I suspect that it won't be in 100% of those devices.
> > But definitely that should be an alternative to the "classic" way of
> > handling dma'able memory (pin_user_pages()).
>
> My point was that iommufd can do the pinning for you and dump that
> result into a iommu based PASID, or it can do the pinning for you and
> allow the driver to translate it into its own page table format eg the
> ASID in the habana device.
>
> We don't need to have map/unmap APIs to manage address spaces in every
> subsystem.
I see, I will need to learn this more in-depth if/when it will become
relevant. But of course I agree that reusing existing code is much
better.

>
> > Maybe this is something that should be discussed in the kernel summit ?
>
> Maybe, I expect to be at LPC at least
Great!
Oded
>
> Jason