RE: [tip:irq/core] genirq/matrix: Improve target CPU selection for managed interrupts.

From: Thomas Gleixner
Date: Wed Nov 07 2018 - 15:23:27 EST


Michael,

On Wed, 7 Nov 2018, Michael Kelley wrote:
> > 2) Managed interrupts:
> >
> > Managed interrupts guarantee vector reservation when the MSI/MSI-X
> > functionality of a device is enabled, which is achieved by reserving
> > vectors in the bitmaps of the possible target CPUs. This reservation
> > decrements the available count on each possible target CPU.
> >
>
> For the curious, could you elaborate on the reservation guarantee for
> managed interrupts? What exactly is guaranteed? I'm trying to
> understand the benefit of reserving a vector on all possible target CPUs.
> I can imagine this may be to related hot-remove of CPUs, but I'm not
> seeing the scenario where reserving on all possible target CPUs solves
> any fundamental problem. irq_build_affinity_masks() assigns spreads
> target CPUs across each IRQ in the batch, so you might get a small handful
> of possible target CPUs for each IRQ. But if those small handful of CPUs
> were to be hot-removed, then all the reserved vectors disappear anyway.
> So maybe there's another scenario I'm missing.

When managed interrupts are allocated (MSI[-X] enable) then each allocated
Linux interrupt (virtual irq number) is given an affinity mask in the
spreading algorithm. The mask contains 1 or more CPUs depending on the
ratio of queues and possible CPUs.

When the virtual irq and the corresponding data structures are allocated,
then a vector is reserved on each CPU in the affinity mask.

The device driver and other layers like block-mq rely on the associated
affinity mask of each interrupt, i.e. they associate a device queue to the
exact same affinity mask. All I/O on the CPUs in the mask goes through that
associated device queue.

So if the allocation would not be guaranteed and allowed to fail, then the
I/O association would not work as expected.

Sure, we could move the interrupt to a random CPU, but that would cause
performance problems especially when the interrupt affinity moves to a
different node.

Now you might argue that reserving one vector on one CPU in the mask would
be sufficient. That's true, if CPU hotplug is disabled and all CPUs are
online when the device driver is initialized.

But it would break assumptions in the CPU hotplug case. The guaranteed
reservation on all CPUs in the associated CPU mask guarantees that the
interrupt can be moved from the outgoing CPU to a still online CPU in the
mask without violating the affinity association.

There is another interesting property of managed interrupts vs. CPU
hotplug. When the last CPU in the affinity mask goes offline, then the core
code shuts down the interrupt and the device driver and related layers
exclude the associated device queue from I/O. The same applies for CPUs
which are not online when the device is initialized, i.e. if non of the
CPUs is online then the interrupt is not started and the I/O queue stays
disabled.

When the first CPU in the mask comes online (again), then the interrupt is
reenabled and the device driver and related layers reenable I/O on the
associated device queue.

If the reservation would not be guaranteed even accross offline/online
cycles, then again the assumptions of the drivers and the related layers
would not longer work.

Note, that the affinity of managed interrupts cannot be changed from
userspace via /proc/irq/$N/affinity for the same reasons.

That was a design decision to simplify the block multi-queue logic in the
device drivers and the related layers. It removed the whole track affinity
changes, reallocate data structures and reroute I/O requirements. Some of
the early multi-queue device drivers implemented horrible hacks to handle
all those horrors.

Hope that answers your question.

Thanks,

tglx