Re: [RFC/PATCHv2] x86/irq: round-robin distribution of irqs to cpus w/in node

From: Eric W. Biederman
Date: Tue Sep 28 2010 - 06:59:44 EST


Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes:

> On Mon, 27 Sep 2010, Eric W. Biederman wrote:
>> > On Mon, 27 Sep 2010, Arthur Kepner wrote:
>> The deep bug is that create_irq_nr allocates a vector (which it does
>> because at the time there was no better way to mark an irq in use on
>> x86). In the case of msi-x we really don't know the node that irq is
>> going to be used on until we get a request irq. We simply know which
>> node the device is on.
>
> Bah. So the whole per node allocation business is completely useless
> at this point.

Probably.

>> If you want to see what is going follow the call trace looks like.
>> pci_enable_msix
>> arch_setup_msi_irqs
>> create_irq_nr
>>
>> After pci_enable_msix is finished then the driver goes and makes all
>> of the irqs per cpu irqs.
>>
>> There are goofy things that happen when hardware asks for 1 irq per cpu.
>> But since msi can ask for up to 4096 irqs (assuming the hardware
>> supports it) I can totally see putting all 256 of those irqs on a single
>> cpu, before you go to user space and let user space or something
>> reassign all of those irqs in a per cpu way.
>>
>> My gut feel says that the real answer is to delay assigning a vector
>> to an irq until request_irq(). At which point we will know that someone
>> at least wants to use the irq.
>
> Right. So the solution would be:
>
> create_irq allocates an irq number + irq descriptor, nothing else
>
> chip->startup() will setup the vector and chip->shutdown releases
> it. That requires to change the return value of chip->startup to int,
> so we can return an error code, but that can be done in course of the
> overhaul I'm working on.
>
> Right now I prefer not to add more crap to io_apic.c, it's horrible
> enough already. I'll fix that with the cleanup.

Understood. It has taken a couple of years before this bug finally
bit anyone waiting a release or two to get it fixed properly seems
reasonable.

pci_enable_msix all in it's own way is fixable, but it has
few enough callers < 80 that it is also fixable.

drivers/pci/msi.c and drivers/pci/htirq.c are interesting in
that they are arch independent users of the generiq layer. Which
is why msi_desc needed a new field.

Eric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/