Re: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offliningtriggered "active" device IRQ interrruption

From: Gary Hade
Date: Thu Jun 04 2009 - 16:05:07 EST


On Wed, Jun 03, 2009 at 02:13:23PM -0700, Eric W. Biederman wrote:
> Gary Hade <garyhade@xxxxxxxxxx> writes:
>
> > Correct, after the fix was applied my testing did _not_ show
> > the lockups that you are referring to. I wonder if there is a
> > chance that the root cause of those old failures and the root
> > cause of issue that my fix addresses are the same?
> >
> > Can you provide the test case that demonstrated the old failure
> > cases so I can try it on our systems? Also, do you recall what
> > mainline version demonstrated the old failure
>
> The irq migration has already been moved to interrupt context by the
> time I started working on it. And I managed to verify that there were
> indeed problems with moving it out of interrupt context before my code
> merged.
>
> So if you want to reproduce it reduce your irq migration to the essentials.
> Set IRQ_MOVE_PCNTXT, and always migrate the irqs from process context
> immediately.
>
> Then migrate an irq that fires at a high rate rapidly from one cpu to
> another.
>
> Right now you are insulated from most of the failures because you still
> don't have IRQ_MOVE_PCNTXT. So you are only really testing your new code
> in the cpu hotunplug path.

OK, I'm confused.

It sounds like you want me force IRQ_MOVE_PCNTXT so that I can
test in a configuration that you say is already broken. Why
in the heck would this config, where you expect lockups without
the fix, be a productive environment in which to test the fix?

>
> Now that I look at it in more detail you are doing a double
> mask_IO_APIC_irq and unmask_IO_APIC_irq on the fast path

Good question. I had based my changes on the previous
CONFIG_INTR_REMAP IRQ migration delay code that was doing
the same thing. I had assumed that the CONFIG_INTR_REMAP
code was correct and that the nesting of
mask_IO_APIC_irq_desc/unmask_IO_APIC_irq_desc sequences
was OK. Now that I look at how mask_IO_APIC_irq_desc and
unmask_IO_APIC_irq_desc are implemented I see that there
is e.g. no reference counting to assure that only the last
caller of unmask_IO_APIC_irq_desc wins. This reduces the
range of code that is executed with the IRQ masked.

Do you see any reason why the IRQ needs to be masked
downstream of the unmask_IO_APIC_irq_desc call that the
patch adds?

> and duplicating the pending irq check.

Yes, I was aware of this and had considered removing
the current check but I was trying to minimize the changes
as much as possible, especially those changes affecting
the normal IRQ migration path.

Removal of the current check would simplify the code
and would also eliminate the failed affinity change
requests from user level that can result from the
current check.

Do you think it would be a good idea to remove this
extra check? If so, I would prefer to do it with a
separate patch so that I can do a good job of testing it.

Thanks,
Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
garyhade@xxxxxxxxxx
http://www.ibm.com/linux/ltc

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/