RE: Panic at _blk_run_queue on 2.6.32

From: Rich, Jason
Date: Fri Jul 19 2013 - 10:40:05 EST


> -----Original Message-----
> From: linux-kernel-owner@xxxxxxxxxxxxxxx [mailto:linux-kernel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Rich, Jason
> Sent: Monday, July 15, 2013 9:10 AM
> To: Willy Tarreau
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Subject: RE: Panic at _blk_run_queue on 2.6.32
>
> > -----Original Message-----
> > From: linux-kernel-owner@xxxxxxxxxxxxxxx [mailto:linux-kernel-
> > owner@xxxxxxxxxxxxxxx] On Behalf Of Willy Tarreau
> > Sent: Wednesday, July 10, 2013 3:27 PM
> > To: Rich, Jason
> > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > Subject: Re: Panic at _blk_run_queue on 2.6.32
> >
> > Hi Jason,
> >
> > On Tue, Jul 09, 2013 at 05:42:29PM +0000, Rich, Jason wrote:
> > > Greetings,
> > > I've recently encountered an issue where multiple hosts are failing
> > > to boot up about 1/5 of the time. So far I have confirmed this
> > > issue on three
> > seperate host machines. The issue presents itself after updating
> > 2.6.32.39 to patch 50 and patch 61.
> > > Both patch levels result in the failure described below. Since this
> > > occurs on
> > multiple hosts, I feel I can safely rule out hardware.
> >
> > First, thank you for your very detailed report. Do you think you could
> > narrow this down to a specific kernel version ? Given that there are
> > exactly 10 versions between .39 and .50, I think that a version-level
> > bisect would take
> > 3 or 4 builds (so probably around 20 reboots).
>
> I was out of town for a little while there, but I plan to do just that in a little
> while. I will let you know what I find. Hopefully it won't take me too terribly
> long.
>
> >
> > It would help us spot the faulty patch. Right now, there are 546
> > patches between .39 and .50 so it's quite hard to find the culprit,
> > even with your full trace. That does not mean we'll immediately spot
> > it, maybe a deeper bisect will be needed, but it should be easier.
> >
> > > It is also of note that I have not seen this behavior on the 3.4.26
> > > kernel, or
> > on any of my 32bit hosts.
> >
> > This is a good news, because we're probably missing one fix from a
> > more recent version that addressed a similar regression and that we
> > might backport into 2.6.32.62.
> >
> > > That said, I have to support this software release (which runs on
> > > the 2.6
> > kernel) for at least another two years.
> >
> > Be careful on this point, 2.6.32 is planned for EOL next year :
> >
> > https://www.kernel.org/category/releases.html
> >
> > You might want to consider migrating to a supported distro kernel or
> > to 3.2 instead. That said, if you follow carefully the updates from
> > later kernels, you might prefer to maintain your own backports of the
> > patches that are relevant to your usage.
>
> Thanks, we already have pulled in 3.4 to our released product, but I still have
> to support my product's previous releases for a time. My goal is to patch up
> to .61 plus a fix to this issue and never touch the release again. Worst case,
> I'll stay on 2.6.32.39 and cherry pick. I'd really hate to do that, however.
> Anyway, as stated earlier, I'll bisect and try to narrow this down. Appreciate
> the help so far and really hope we just have to back patch a fix.
>
> Jason
> >
> > Best regards,
> > Willy
> >

Just a small update from this week of trying to narrow it down. Long story short I've gotten about 3 bisects in. The failures are appearing less often than previously seen on these two particular machines. It feels like maybe 1/40 reboots. In any case, finding a "good" revision of kernel code will require me to run my test at least overnight to be sure. My test is a simple reboot the system every 5 minutes. When it crashes, I have a terminal window open to show it hung up.
In case you are actively poking around, I've ruled out quite a bit so far. If I understand bisect correctly (this is my first time to use it actually), it took me below 2.6.32.42's tag.
Bisect log:
# bad: [60b1e4f20a6cf45f07d2aef7eecd7fd58007ff1e] Linux 2.6.32.50
# good: [145fff1f0b75c8bd6a26052d638276bb2e009983] Linux 2.6.32.39
git bisect start 'v2.6.32.50' 'v2.6.32.39'
# bad: [1ff36a0e02f978e533b13ce6a86ad6a73444cec8] cfq-iosched: fix locking around ioc->ioc_data assignment
git bisect bad 1ff36a0e02f978e533b13ce6a86ad6a73444cec8
# bad: [1183c16343f6daff3e418f8c782ce924f52ae148] tehuti: Firmware filename is tehuti/bdx.bin
git bisect bad 1183c16343f6daff3e418f8c782ce924f52ae148
# bad: [0ec1c448546ccd6413dd864bf007a13a3af4c7c4] SUNRPC: fix NFS client over TCP hangs due to packet loss (Bug 16494)
git bisect bad 0ec1c448546ccd6413dd864bf007a13a3af4c7c4

> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-kernel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/