Re: [RFC/PATCHv2] x86/irq: round-robin distribution of irqs to cpus w/in node

From: Eric W. Biederman
Date: Mon Sep 27 2010 - 20:17:19 EST


Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes:

> On Mon, 27 Sep 2010, Arthur Kepner wrote:
>
>> On Mon, Sep 27, 2010 at 10:46:02PM +0200, Thomas Gleixner wrote:
>> > ...
>> > Sigh. Why is this a x86 specific problem ?
>> >
>>
>> It's obviously not. But we're particularly seeing it on x86
>> systems, so an x86-specific fix would address our problem.
>
> Even more sigh.

The fact that x86 has vectors probably doesn't help.

>> > If we setup an irq on a node then we should set the affinity to the
>> > target node in general.
>>
>> OK.
>>
>> > .... The round robin inside the node is really not
>> > a problem unless you hit:
>> >
>> > nr_irqs_per_node * nr_cpus_per_node > max_vectors_per_cpu
>> >
>>
>> No, I don't think that's true.
>>
>> The problem we're seeing is that one driver asks for a large
>> number of interrupts (on no CPU in particular). And because of the
>
> It does it for a node, dammit. Otherwise your patch would be
> absolutely useless.

We derive a node from where the device is plugged in. The driver
does not specify a node.

>> > > + if ((node != -1) && alloc_cpumask_var(&tmp_mask, GFP_ATOMIC)) {
>
>> way that the vectors are initially assigned to CPUs (in
>> __assign_irq_vector()), a particular CPU can have all its vectors
>> consumed.
>
> Stop selling me crap already.

The deep bug is that create_irq_nr allocates a vector (which it does
because at the time there was no better way to mark an irq in use on
x86). In the case of msi-x we really don't know the node that irq is
going to be used on until we get a request irq. We simply know which
node the device is on.

If you want to see what is going follow the call trace looks like.
pci_enable_msix
arch_setup_msi_irqs
create_irq_nr

After pci_enable_msix is finished then the driver goes and makes all
of the irqs per cpu irqs.

There are goofy things that happen when hardware asks for 1 irq per cpu.
But since msi can ask for up to 4096 irqs (assuming the hardware
supports it) I can totally see putting all 256 of those irqs on a single
cpu, before you go to user space and let user space or something
reassign all of those irqs in a per cpu way.

My gut feel says that the real answer is to delay assigning a vector
to an irq until request_irq(). At which point we will know that someone
at least wants to use the irq.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/