Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs

From: Ramon Fried
Date: Mon Jan 20 2020 - 03:02:29 EST


On Sat, Jan 18, 2020 at 12:47 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> Ramon,
>
> Ramon Fried <rfried.dev@xxxxxxxxx> writes:
> > On Fri, Jan 17, 2020 at 7:11 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >> The device which incorporates the MSI endpoint.
> >
> > This is not how the MSI specs describe it, so I'm confused.
> > According to spec, MSI is just an ordinary post PCIe TLP to a certain
> > memory on the root-complex.
>
> That's the message transport itself.
>
> > The only information it has whether to send an MSI or not is the
> > masked/pending register in the config space.
>
> Correct.
>
> > So, basically, back to my original question, without tinkering with
> > these bits, the device will always send the MSI's,
>
> What do you mean with 'always send'?
>
> It will send ONE message every time the device IP block raises an
> interrupt as long as it's not masked in the config space.
>
> > it's just that they will be masked on the MSI controller on the
> > host. right ?
>
> No. If you mask them on the host then you can lose interupts.
>
> Lets take an example. Network card.
>
> Incoming packet
> network controller raises interrupt
> MSI endpoint sends message
> Message raises interrupt in CPU
>
> interrupt is serviced
> handle_edge_irq()
> acknowledge interrupt at the CPU level
> call_driver_interrupt_handler()
> fiddle_with_device()
> return from interrupt
>
> So now if you use a threaded handler in that driver (or use force
> threading) then this looks so:
>
> Incoming packet
> network controller raises interrupt
> MSI endpoint sends message
> Message raises interrupt in CPU
>
> interrupt is serviced
> handle_edge_irq()
> acknowledge interrupt at the CPU level
> call_primary_interrupt_handler()
> wake_irq_thread()
> return from interrupt
>
> run_irq_thread()
> call_driver_interrupt_handler()
> fiddle_with_device()
> wait_for_next_irq();
>
> In both cases the network controller can raise another interrupt
> _before_ the intial one has been fully handled and of course the MSI
> endpoint will send a new message which triggers the pending logic in the
> edge handler or in case of a threaded handler kicks the thread to run
> another round.
>
> Now you might think that if there are tons of incoming packets then the
> network controller will raise tons of interrupts before the interrupt
> handler completes. That would be outright stupid. So what the network
> controller (assumed it is sanely designed) does is:
>
> packet arrives
> if (!raised_marker) {
> raise_interrupt;
> set raised_marker;
> }
>
> So now the interrupt handler comes around to talk to the device and the
> processing clears the raised_marker at some point. Either by software or
> automatically when the queue is empty.
>
> If you translate that into a electrical diagram:
>
> Packet 1 2 3 4
>
> ________ _____ _
> NetC-Int _| |___| |_| |_____
>
> MSI-EP M M M M = Message
>
> CPU INT | | |
>
> Driver _______ _________
> handler ____| |____| |______
>
> If you look at packet #4 then you notice that the interrupt for this
> packet is raised and the message is sent _before_ the handler finishes.
>
> And that's where we need to look at interrupt masking.
>
> 1) Masking at the MSI endpoint (PCI configspace)
>
> This is slow and depending on the PCI host this might require
> to take global locks, which is even worse if you have multi queue
> devices firing all at the same time.
>
> So, no this is horrible and it's also not required.
>
> 2) Masking at the host interrupt controller
>
> Depending on the implementation of the controller masking can cause
> interrupt loss. In the above case the message for packet #4 could
> be dropped by the controller. And yes, there are interrupt
> controllers out there which have exactly this problem.
>
> That's why the edge handler does not mask the interrupt in the first
> place.
>
> So now you can claim that your MSI host controller does not have that
> problem. Fine, then you could do masking at the host controller level,
> but what does that buy you? Lets look at the picture again:
>
> Packet 1 2 3 4
>
> ________ _____ ____
> NetC-Int _| |___| |_| |__
>
> MSI-EP M M M M = Message
>
> CPU INT | | |
> Driver _________ ________ __
> handler ____M U____M U_M U____ M = Mask, U = Unmask
>
> You unmask just to get the next interrupt so you mask/handle/unmask
> again. That's actually slower because you get the overhead of unmask,
> which raises the next interrupt in the CPU (it's already latched in the
> MSI translator) and then yet another mask/unmask pair. No matter what,
> you'll lose.
>
> And if you take a look at network drivers, then you find quite some of
> them which do only one thing in their interrupt service routine:
>
> napi_schedule();
>
> That's raising the NAPI softirq and nothing else. They touch not even
> the device at all and delegate all the processing to softirq
> context. They rely on the sanity of the network controller not to send
> gazillions of interrupts before the pending stuff has been handled.
>
> That's not any different than interrupt threading. It's exactly the same
> except that the handling runs in softirq context and not in an dedicated
> interrupt thread.
>
> So if you observe issues with your PCI device that it sends gazillions
> of interrupts before the pending ones are handled, then you might talk
> to the people who created that beast or you need to do what some of the
> network controllers do:
>
> hard_interrupt_handler()
> tell_device_to_shutup();
> napi_schedule();
>
> and then something in the NAPI handling tells the device that it can
> send interrupts again.
>
> You can do exactly the same thing with interrupt threading. Register a
> primary handler and a threaded handler and let the primary handler do:
>
> hard_interrupt_handler()
> tell_device_to_shutup();
> return IRQ_WAKE_THREAD;
>
> Coming back to your mask/unmask thing. That has another downside which
> is layering violation and software complexity.
>
> MSI interrupts are edge type by specification:
>
> "MSI and MSI-X are edge-triggered interrupt mechanisms; neither the
> PCI Local Bus Specification nor this specification support
> level-triggered MSI/MSI-X interrupts."
>
> The whole point of edge-triggered interrupts is that they are just a
> momentary notification which means that they can avoid the whole
> mask/unmask dance and other issues. There are some limitations to edge
> type interrupts:
>
> - Cannot be shared, which is a good thing. Shared interrupts are
> a pain in all aspects
>
> - Can be lost if the momentary notification does not reach the
> receiver. For actual electrical edge type interrupts this happens
> when the active state is too short so that the edge detection
> on the receiver side fails to detect it.
>
> For MSI this is usually not a problem. If the message gets lost on
> the bus then you have other worries than the lost interrupt.
>
> But for both electrical and message based the interrupt receiver on
> the host/CPU side can be a problem when masking is in play. There
> are quite some broken controllers out there which have that issue
> and it's not trivial to get it right especially with message based
> interrupts due to the async nature of the involved parts.
>
> That's one thing, but now lets look at the layering.
>
> Your MSI host side IP is not an interrupt controller. It is a bridge
> which translates incoming MSI messages and multiplexes them to a level
> interrupt on the GIC. It provides a status register which allows you to
> demultiplex the pending interrupts so you don't have to poll all
> registered handlers to figure out which device actually fired an
> interrupt. Additionally it allows masking, but that's an implementation
> detail and you really should just ignore it except for startup/shutdown.
>
> From the kernels interrupt system POV the MSI host side controller is
> just a bridge between MSI and GIC.
>
> That's clearly reflected in the irq hierarchy:
>
> |-------------|
> | |
> | GIC |
> | |
> |-------------|
>
> |-------------| |----------|
> | | | |
> | MSI bridge |---------| PCI/MSI |
> | | | |
> |-------------| |----------|
>
> The GIC and the MSI bridge are independent components. The fact that the
> MSI bridge has an interrupt output which is connected to the GIC does
> not create an hierarchy. From the GIC point of view the MSI bridge is
> just like any other peripheral which is connected to one of its input
> lines.
>
> But the PCI/MSI domain has a hierarchical parent, the MSI Bridge. The
> reason why this relationship exists is that the PCI/MSI domain needs a
> way to allocate a message/address for interrupt delivery. And that
> information is provided by the MSI bridge domain.
>
> In an interrupt hierarchy the type of the interrupt (edge/level) and the
> required handler is determined by the outmost domain, in this case the
> PCI/MSI domain. This domain mandates edge type and the edge handler.
>
> And that outermost domain is the primary interrupt chip which is
> involved when the core code manages and handles interrupts. So
> mask/unmask happens at the pci_msi interrupt chip which fiddles with the
> MSI config space. The outermost device can call down into the hierarchy
> to let the underlying domain take further action or delegate certain
> actions completely to the underlying domain, but that delegation is
> pretty much restricted. One example for delegation is the irq_ack()
> action. The ack has to hit the underlying domain usually as on the MSI
> endpoint there is no such thing. If the underlying domain does not need
> that then the irq_ack() routine in the underlying domain is just empty
> or not implemented. But you cannot delegate mask/unmask and other
> fundamental actions because they must happen on the MSI endpoint no
> matter what.
>
> You cannot create some artifical level semantics on the PCI/MSI side and
> you cannot artificially connect your demultiplexing handler to the
> threaded handler of the PCI interrupt without violating all basic rules
> of engineering and common sense at once.
>
> Let me show you the picture from above expanded with your situation:
>
> Packet 1 2 3 4
>
> ________ _____ _
> NetC-Int _| |___| |_| |_____
>
> MSI-EP M M M M = Message
>
> _ _ _
> Bridge __| |__________| |____| |_______
>
> _ _ _
> GIC input __| |__________| |____| |_______
>
> CPU INT | | |
>
> Demux _ _ _
> handler __A |__________A |____A |_______ A == Acknowledge in the bridge
>
> Thread _______ _________
> handler ____| |____| |______
>
> Hope that helps and clarifies it.
>
> Thanks,
>
> tglx
Wow Thomas, this is an amazing answer, I need to go over it few times
to see that I understand everything.
I wish we there was a way to pin this somewhere, so it won't get lost
in the mailing list archive, I think it's a very
nice explanation that should have it's wiki page or something.

Thanks,
Ramon.