Re: v2.6.21.5-rt19 (sched_getaffinity?)

From: Fernando Lopez-Lezcano
Date: Mon Jul 09 2007 - 01:08:41 EST


On Sun, 2007-07-08 at 20:53 -0700, Fernando Pablo Lopez-Lezcano wrote:
> On Mon, 9 Jul 2007, Gabriel C wrote:
> > Fernando Lopez-Lezcano wrote:
> >> On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote:
> >>> * Fernando Lopez-Lezcano <nando@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>>> Changes since 2.6.21.5-rt18:
> >>>>>> - Fixed a nasty and hard to track down slowness / boot problem on SMP
> >>>>>> machines with CONFIG_NOHZ enabled. The problem was caused by the timer
> >>>>>> wheel base lock held during the get_next_timer_interrupt() call in the
> >>>>>> idle path, which eventually led to a bogus PI boosting of the idle task
> >>>>>> and in consequence a stale wrong scheduler selection for the affected
> >>>>>> idle
> >>>>>> task.
> >>>>>>
> >>>>>> Kudos to Carsten Emde, who patiently and meticulously isolated the
> >>>>>> problem and provided the traces, which allowed to identify the root
> >>>>>> cause.
> >>>>>>
> >>>>>> Problem solution: Prevent idle task boosting
> >>>>>>
> >>>>> Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on
> >>>>> my Core2 T7200 laptop (fujitsu-siemens amilo i1520).
> >>>>>
> >>>>> Althought I'm still with my fingers crossed, I can tell the good news
> >>>>> are that 2.6.21.5-rt19 (and -rt20) does behave far better now on the
> >>>>> very same box.
> >>>>>
> >>>> Yes, it works much better indeed...
> >>>>
> >>>> Ingo: is there a place where I can read about the changes in different
> >>>> rtxx releases? What is new/better/fixed in rt20? (I see scheduler stuff
> >>>> in a diff from rt19 to rt20 but I don't really know what it means).
> >>>>
> >>> and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when CFS
> >>> was merged.
> >>>
> >>> i _think_ Rui might have seen two separate problems. Perhaps by the time
> >>> we fixed the first problem (which Rui saw since -rt2) we introduced the
> >>> other one via -rt11 - which then got fixed in -rt19.
> >>
> >> Ahh, CFS is now part of rt, I was obviously not paying attention... I'm
> >> really trying to provide a "stable" rt kernel for audio usage and
> >> including another subsystem into rt is - IMHO - not going to help.
> >> What's the chance of splitting things?
> >>
> >>> btw., we'd love to get more feedback regarding CFS. CFS is a completely
> >>> new scheduler for Linux.
> >>
> >> Then I'd rather have it separate from rt.
> >>
> >>> It has a design centered around keeping application latencies down, so it
> >>> is ultimately real-time friendly, and it should also make things work
> >>> better for desktop-ish and audio-ish stuff as well. (even under
> >>> SCHED_OTHER)
> >>>
> >>
> >> Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing
> >> list):
> >>
> >> On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote:
> >>
> >>> Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine
> >>> on my desktop but on my laptop it makes Firefox and Tomboy to crash.
> >>> On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no
> >>> problem.
> >>>
> >> I managed to completely hang firefox (fc7) with flash 9 installed
> >> (unkillable even with -9).
> >
> > Firefox with flash 9 does not work good , there are a lot bugs reported
> > about ( just google ) and it hangs on vanilla or whatever other kernels
> > as well. Not only Firefox but also Swiftfox, Opera, Epiphany etc.
> >
> > The most time Firefox dies when you use flash 9 and close a window or a
> > tab.
>
> More tests...
>
> The problem is the rt kernel AFAICT, this goes beyond Flash 9, way
> beyond:
>
> _OpenOffice_ hangs with 2.6.21.5-rt20, works fine with stock Fedora 7
> kernel. Flash 9 hangs with 2.6.21.5-rt20, works fine with the stock Fedora
> 7 kernel. Same machine booting different kernels, I'd say it is the
> kernel.
>
> The only way out for a hung app is a reboot.
>
> Ingo: what would be a good way to trace this? It makes the rt kernels not
> very usable at least on this hardware (more tests tomorrow in the CCRMA
> machines).
>
> Same on 2.6.21.5-rt18 with CONFIG_NO_HZ not set.

I forgot to include the output of strace... and of course now I can't
repeat the openoffice hang.

I do get flash 9 (I know, not the best example) and tomboy to hang as
reported by one of my Planet CCRMA users - flash 9 tested working on
stock fedora 7 kernel - and both seem to hang in the same system call:

sched_getaffinity(3528, 32, <unfinished ...>

Full output of strace attached for both cases.

Hopefully this will make the bug immediately obvious to someone :-)
[running on a laptop with the 7700 Intel cpu]
-- Fernando

Attachment: firefox.trace.gz
Description: GNU Zip compressed data

Attachment: tomboy.trace.gz
Description: GNU Zip compressed data