Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

From: Thomas Gleixner
Date: Sun Dec 05 2021 - 09:16:57 EST


On Sat, Dec 04 2021 at 15:20, Thomas Gleixner wrote:
> On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> So I need to break that up in a way which caters for both cases, but
> does neither create a special case for PCI nor for the rest of the
> universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
> means all of it is common case. With that solved the storage question
> becomes a nobrainer.
>
> When I looked at it again yesterday while writing mail, I went back to
> my notes and found the loose ends where I left off. Let me go back and
> start over from there.

I found out why I stopped looking at it. I came from a similar point of
view what you were suggesting:

>> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
>> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
>> the driver asks for PCI MSI access we create a new hierarchical
>> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
>> you outlined for IMS? (again, not saying to do this, but let's ask if
>> that makes more sense than the current configuration)

Which I shot down with:

> That's not really a good idea because dev->irqdomain is a generic
> mechanism and not restricted to the PCI use case. Special casing it for
> PCI is just wrong. Special casing it for all use cases just to please
> PCI is equally wrong. There is a world outside of PCI and x86.

That argument is actually only partially correct.

After studying my notes and some more code (sigh), it looks feasible
under certain assumptions, constraints and consequences.

Assumptions:

1) The irqdomain pointer of PCI devices which is set up during device
discovery is not used by anything else than infrastructure which
knows how to handle it.

Of course there is no guarantee, but I'm not that horrified about
it anymore after chasing the exposure with yet more coccinelle
scripts.

Constraints:

1) This is strictly opt-in and depends on hierarchical irqdomains.

If an architecture/subarchitecture wants to support it then it
needs to rework their PCI/MSI backend to hierarchical irqdomains or
make their PCI/MSI irqdomain ready for the task.

From my inspection of 30+ PCI/MSI irqdomains, most of them should
be trivial to convert. The hard ones are powerpc, XEN and VMD,
where XEN is definitely the most convoluted one.

That means that devices which depend on IMS won't work on anything
which is not up to date.

2) Guest support is strictly opt-in

The underlying architecture/subarchitecture specific irqdomain has
to detect at setup time (eventually early boot), whether the
underlying hypervisor supports it.

The only reasonable way to support that is the availability of
interrupt remapping via vIOMMU, as we discussed before.

3) IOMMU/Interrupt remapping dependency

While IMS just works without interrupt remapping on bare metal the
fact that there is no reliable way to figure out whether the kernel
runs on bare metal or not, makes it pretty much mandatory, at least
on x86.

That's not a hardcoded constraint. It's a decision made during the
setup of the underlying architecture/subarchitecture specific
irqdomain.

4) The resulting irqdomain hierarchy would ideally look like this:

VECTOR -> [IOMMU, ROUTING, ...] -> PCI/[MSI/MSI-X/IMS] domains

That does not work in all cases due to architecture and host
controller constraints, so we might end up with:

VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains

The nice thing about the irqdomain hierarchy concept is that this
does not create any runtime special cases as the base hierarchy is
established at boot or device detection time. It's just another
layer of indirection.

5) The design rules for the device specific IMS irqdomains have to be
documented and enforced to the extent possible.

Rules which I have in my notes as of today:

- The device specific IMS irq chip / irqdomain has to be strictly
separated from the rest of the driver code and can only
interact via the irq chip data which is either per interrupt or
per device.

I have some ideas how to enforce these things to go into
drivers/irqchip/ so they are exposed to scrutiny and not
burried in some "my device is special" driver code and applied
by subsystem maintainers before anyone can even look at it.

- The irqchip callbacks which can be implemented by these top
level domains are going to be restricted.

- For the irqchip callbacks which are allowed/required the rules
vs. following down the hierarchy need to be defined and
enforced.

- To achieve that the registration interface will not be based on
struct irq_chip. This will be a new representation and the core
will convert that into a proper irq chip which fits into the
hierarchy. This provides one central place where the hierarchy
requirements can be handled as they depend on the underlying
MSI domain (IOMMU, SHIM, etc.). Otherwise any change on that
would require to chase the IMS irqchips all over the place.

Consequences:

1) A more radical split between legacy and hierarchical irqdomain
code in drivers/pci/msi/ into:

- api
- legacy
- irqdomain
- shared

That means that we are going to end up with duplicated code for
some of the mechanisms up to the point where the stuck-in-the-mud
parts either get converted or deleted.

2) The device centric storage concept will stay as it does not make
any sense to push it towards drivers and what's worse it would be a
major pain vs. the not yet up to the task irqdomains and the legacy
architecture backends to change that. I really have no interrest to
make the legacy code

It also makes sense because the interrupts are strictly tied to the
device. They cannot originate from some disconnected layer of thin
air.

Sorry Jason, no tables for you. :)

How to get there:

1) I'm going to post part 1-3 of the series once more with the fallout
and review comments addressed.

2) If nobody objects, I'll merge that into tip irq/msi and work on top
of that.

The consolidation makes sense on it's own and is required anyway. I
might need to revisit some of the already touched places, but that
should be a halfways limited number. I rather do that step for step
on top than going back to start and mixing the new concepts in from
the very beginning.

But I drop part 4 in it's current form because that's going to be
part of the new infrastructure.

3) I'll work on that bottom up towards a driver exposable API as that
is going to be a result of the final requirements of the underlying
infrastructure.

The final driver visible interfaces can be bikeshedded on top to
make them palatable for driver writers.

4) Convert x86 PCI/MSI[x] to the new scheme

5) Implement an IMS user.

The obvious candidate which should be halfways accessible is the
ath11 PCI driver which falls into that category.

It uses horrendous hackery to make it "work" by abusing MSI. It's a
wonder that it works at all, by some definition of "works".

I'm pretty sure how to make it fall apart without touching a single
line of code.

With a small code change I can make it fail hard without blowing up
any other MSI/MSI-X user except the switchtec NTB.

That's a prime example for the way how driver writers behave.

Instead of talking to the people who are responsible for the
interrupt subsystem, they go off and do their own thing. It does
not explode on their test machine, but it's not even remotely close
to the requirements for PCI drivers to work independent of the
underlying platform.

Of course the responsible maintainer does not even notice and waves
it through without questioning it.

Thoughts?

Thanks,

tglx