RE: 2.6.19-rc1 genirq causes either boot hang or "do_IRQ: cannot handle IRQ -1"
From: Protasevich, Natalie
Date: Mon Oct 09 2006 - 11:28:45 EST
> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@xxxxxxxxxxxx]
> Sent: Monday, October 09, 2006 8:46 AM
> To: Arjan van de Ven
> Cc: Linus Torvalds; Muli Ben-Yehuda; Ingo Molnar; Thomas
> Gleixner; Benjamin Herrenschmidt; Rajesh Shah; Andi Kleen;
> Protasevich, Natalie; Luck, Tony; Andrew Morton;
> Linux-Kernel; Badari Pulavarty; Roland Dreier
> Subject: Re: 2.6.19-rc1 genirq causes either boot hang or
> "do_IRQ: cannot handle IRQ -1"
>
> Arjan van de Ven <arjan@xxxxxxxxxxxxx> writes:
>
> >> > So yes, having software say "We want to steer this particular
> >> > interrupt to this L3 cache domain" sounds eminently sane.
> >> >
> >> > Having software specify which L1 cache domain it wants
> to pollute
> >> > is likely just crazy micro-management.
> >>
> >> The current interrupt delivery abstraction in the kernel
> is a set of
> >> cpus an interrupt can be delivered to. Which seem
> sufficient to the
> >> cause of aiming at a cache domain. Frequently the lower levels of
> >> interrupt delivery map this to a single cpu because of hardware
> >> limitations but in certain cases we can honor a multiple
> cpu request.
> >>
> >> I believe the scheduler has knowledge about different locality
> >> domains for NUMA and everything else. So what is wanting
> on our side
> >> is some architecture? work to do the broad steering by default.
> >
> >
> > well normally this is the job of the userspace IRQ balancer to get
> > right; the thing is undergoing a redesign right now to be
> smarter and
> > deal better with dual/quad core, numa etc etc, but as a principle
> > thing this is best done in userspace (simply because there's higher
> > level information there, like "is this interrupt for a network
> > device", so that policy can take that into account)
>
> So far I have seen all of that higher level information in
> the kernel, and it has to export it to user space for the
> user space daemon to do anything about it.
>
> The only time I have seen user space control being useful is
> when there isn't a proper default policy so at least you can
> distribute things between the cache domains properly. So far
> I have been more tempted to turn it off as it will routinely
> change which NUMA node an irq is pointing at, which is not at
> all ideal for performance, and I haven't seen a way (short of
> replacing it) to tell the user space irq balancer that it got
> it's default policy wrong.
>
> It is quite possible I don't have the whole story, but so far
> it just feels like we are making a questionable decision by
> pushing things out to user space. If nothing else it seems
> to make changes more difficult in the irq handling
> infrastructure because we have to maintain stable interfaces.
> Things like per cpu counters for every irq start to look
> really questionable when you scale the system up in size.
I'd like also to question current policies of user space irqbalanced. It
seems to just go round-robin without much heuristics involved. We are
seeing loss of timer interrupts on our systems - and the more processors
the more noticeable it is, but it starts even on 8x partitions; on 48x
system I see about 50% loss, on both ia32 and x86_64 (haven't checked on
ia64 yet). With say 16 threads it is unsettling to see 70% overall idle
time, and still only 40-50% of interrupts go through. System's time is
not affected, so the problem is on the back burner for now :) It's not
clear yet whether this is software or hardware fault, and how much
damage it does (performance wise etc.) I will provide more information
as it becomes available, just wanted to raise red flag, maybe others
also see such phenomena.
Thanks,
--Natalie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/