Re: Major 2.6.38 / 2.6.39 / 3.0 regression ignored?

From: Kirill Smelkov
Date: Fri Jul 22 2011 - 07:09:31 EST

Next message: Kamalesh Babulal: "Re: [patch 03/18] sched: introduce primitives to account for CFSbandwidth tracking"
Previous message: Kamalesh Babulal: "Re: [patch 01/18] sched: (fixlet) dont update shares twice on on_rqparent"
In reply to: Linus Torvalds: "Linux 3.0 release"
Next in thread: Herbert Xu: "Re: Major 2.6.38 / 2.6.39 / 3.0 regression ignored?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[ Cc'ing Florian Mickler and Keith Packard ]

On Tue, Jul 12, 2011 at 09:07:47PM +0300, Pekka Enberg wrote:
> On Tue, Jul 12, 2011 at 8:17 PM, Kirill Smelkov <kirr@xxxxxxxxxx> wrote:
> > On Sat, May 28, 2011 at 05:19:20PM +0400, Kirill Smelkov wrote:
> >> Hello Chris, everyone,
> >>
> >> On Sat, May 21, 2011 at 04:40:17PM +0100, Chris Wilson wrote:
> >> > On Sat, 21 May 2011 11:23:53 -0400, "Luke-Jr" <luke@xxxxxxxxxx> wrote:
> >> > > On Saturday, May 21, 2011 4:41:45 AM Chris Wilson wrote:
> >> > > > On Fri, 20 May 2011 11:08:56 -0700, Ray Lee <ray-lk@xxxxxxxxxxxxx> wrote:
> >> > > > > [ Adding Chris Wilson (author of the problematic patch) and Rafael
> >> > > > > Wysocki to the message ]
> >> > > > >
> >> > > > > On Fri, May 20, 2011 at 10:06 AM, Luke-Jr <luke@xxxxxxxxxx> wrote:
> >> > > > > > I submitted https://bugzilla.kernel.org/show_bug.cgi?id=33662 a month
> >> > > > > > ago against 2.6.38. Now 2.6.39 was just released without the
> >> > > > > > regression being addressed. This bug makes the system unusable... Some
> >> > > > > > guys on IRC suggested I
> >> > > > > > email, so here it is.
> >> > > > >
> >> > > > > See the bugzilla entry for the bisection history.
> >> > > >
> >> > > > Which has nothing to do with Luke's bug. Considering the thousand things
> >> > > > that can go wrong during X starting, without a hint as to which it is nigh
> >> > > > on impossible to debug except by trial and error. If you set up
> >> > > > netconsole, does the kernel emit an OOPS with it's last dying breath?
> >> > >
> >> > > Why assume it's a different bug? I would almost wonder if it might affect
> >> > > all Sandy Bridge GPUs. In any case, I no longer have the original
> >> > > motherboard (it was recalled, as I said in the first post), nor even the
> >> > > revision of it (it had other issues that weren't being fixed). I *assume* I
> >> > > will have the same problem with my new motherboard (Intel DQ67SW), but I
> >> > > haven't verified that yet. I'll be sure to try a netconsole when I have to
> >> > > reboot next and get a chance to try the most recent 2.6.38 and .39 kernels,
> >> > > but at the moment it seems reasonable to address the problem bisected in the
> >> > > bug, even if it turns out to be different.
> >> >
> >> > The bisection is into an old DRI1 bug on 945GM. That DRI has inadequate
> >> > locking between release and IRQ and so is prone to such races as befell
> >> > Kirill should not surprise anyone. As neither UMS nor DRI supported SNB,
> >> > I can quite confidently state they are separate bugs.
> >> > -Chris
> >>
> >> I see DRI1 is maybe buggy and old, but still, pre-kms X used to work ok
> >> on kernels < 2.6.38, and starting from 2.6.38 the system is just
> >> unusable because X either crashes the kernel (2.6.38), or does not start
> >> at all (2.6.39):
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=36052
> >>
> >>
> >> It's a regression. It's blocking me to upgrade to newer kernels. I've
> >> done my homework -- digged it and came with detailed OOPS on netconsole
> >> and bisected to single commit. Could this please be fixed?
> >
> > Silence...
> >
> > Still, reverting the bisected patch helps even for 3.0:
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=36052#c4
>
> Keith, Chris, what's up with this regression from 2.6.38? It seems
> commit e8616b6 ("drm/i915: Initialise ring vfuncs for old DRI paths")
> caused problems on other machines.

Silence again, and not surprising -- I was ringing this bell for 3
months already:

https://bugzilla.kernel.org/show_bug.cgi?id=33662#c10
https://bugzilla.kernel.org/show_bug.cgi?id=36052
(and on the list)

with detailed logs and bisected single patch, without even single reply
from intel-gfx people.

And now after v3.0 is out, I've tested it again, and yes, like it was
broken on v3.0-rc5, it is (now even more) broken on v3.0 -- after first
bad io access the system freezes completely:

On netconsole:

# X starts here, then

[ 45.102377] ------------[ cut here ]------------
[ 45.102402] WARNING: at lib/iomap.c:43 bad_io_access+0x3d/0x40()
[ 45.102411] Hardware name: PCISA-945GSE
[ 45.102418] Bad IO access at port 0x84 (return inl(port))
[ 45.102425] Modules linked in:
[ 45.102438] Pid: 2846, comm: sshd Not tainted 3.0.0--NAVY #33
[ 45.102445] Call Trace:
[ 45.102460] [<c118e9fd>] ? bad_io_access+0x3d/0x40
[ 45.102473] [<c10287ec>] warn_slowpath_common+0x6c/0xa0
[ 45.102484] [<c118e9fd>] ? bad_io_access+0x3d/0x40
[ 45.102495] [<c102889e>] warn_slowpath_fmt+0x2e/0x30
[ 45.102506] [<c118e9fd>] bad_io_access+0x3d/0x40
[ 45.102516] [<c118edb2>] ioread32+0x22/0x40
[ 45.102528] [<c122cc7d>] i915_driver_irq_handler+0x1ad/0x660
[ 45.102541] [<c12c6a7e>] ? rtl8169_interrupt+0xee/0x370
[ 45.102554] [<c105c396>] handle_irq_event_percpu+0x36/0x140
[ 45.102565] [<c105e490>] ? handle_edge_irq+0x150/0x150
[ 45.102576] [<c105c4d9>] handle_irq_event+0x39/0x60
[ 45.102587] [<c105e4d5>] handle_fasteoi_irq+0x45/0xd0
[ 45.102594] <IRQ> [<c1003c29>] ? do_IRQ+0x39/0xb0
[ 45.102613] [<c103c9b3>] ? start_flush_work+0xc3/0x130
[ 45.102625] [<c13bc329>] ? common_interrupt+0x29/0x30
[ 45.102636] [<c13bc329>] ? common_interrupt+0x29/0x30
[ 45.102648] [<c11e007b>] ? pnpacpi_encode_resources+0x37b/0x7a0
[ 45.102659] [<c109971e>] ? fget_light+0xe/0xf0
[ 45.102671] [<c10a8f97>] ? do_select+0x2e7/0x680
[ 45.102685] [<c1341998>] ? sch_direct_xmit+0x58/0x1d0
[ 45.102695] [<c10a83e0>] ? poll_freewait+0xa0/0xa0
[ 45.102706] [<c102df37>] ? local_bh_enable+0x47/0xa0
[ 45.102718] [<c132e371>] ? dev_queue_xmit+0x101/0x4e0
[ 45.102729] [<c134ffba>] ? ip_finish_output+0x10a/0x2f0
[ 45.102740] [<c1350216>] ? ip_output+0x76/0x90
[ 45.102750] [<c134d715>] ? ip_local_out+0x65/0x70
[ 45.102762] [<c134fa3d>] ? ip_queue_xmit+0x1bd/0x3b0
[ 45.102775] [<c1362af8>] ? tcp_transmit_skb+0x468/0x7d0
[ 45.102788] [<c13215af>] ? sk_reset_timer+0xf/0x20
[ 45.102798] [<c1362446>] ? tcp_event_new_data_sent+0x86/0xc0
[ 45.102809] [<c1364fc1>] ? tcp_write_xmit+0x1e1/0x9a0
[ 45.102822] [<c1326925>] ? __alloc_skb+0x55/0x100
[ 45.102838] [<c102df37>] ? local_bh_enable+0x47/0xa0
[ 45.102849] [<c1321246>] ? release_sock+0xd6/0x110
[ 45.102859] [<c13657f7>] ? __tcp_push_pending_frames+0x27/0x80
[ 45.102870] [<c13584fa>] ? tcp_sendmsg+0x64a/0xac0

-*- and then system FREEZE -*-

For completeness `X -verbose` log is in "Appendix 1", (but who cares
anyway? I've sent lots of such logs without a reply).

And again, after reverting e8616b6 ("drm/i915: Initialise ring vfuncs
for old DRI paths") on top of v3.0, X works without any problem again.

So I wonder:

I thought people are trying to do "no regressions" rule in kernel.
Should we then just apply the following patch? In case Intel people
are not responding, should it just go directly into mainline?

Or would it be more fair to say that UMS is not supported anymore,
is broken and just remove support for it?

Thanks,
Kirill

P.S. Sometimes people change their hardware preferences based on
software support quality. Knock, knock...

Next message: Kamalesh Babulal: "Re: [patch 03/18] sched: introduce primitives to account for CFSbandwidth tracking"
Previous message: Kamalesh Babulal: "Re: [patch 01/18] sched: (fixlet) dont update shares twice on on_rqparent"
In reply to: Linus Torvalds: "Linux 3.0 release"
Next in thread: Herbert Xu: "Re: Major 2.6.38 / 2.6.39 / 3.0 regression ignored?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]