Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?

From: David Brownell
Date: Sat Mar 06 2004 - 00:00:23 EST


David Mosberger wrote:
OK, finally a bit of progress. If you remember back in October 2003 I
reported:

> One-line summary: plug-in your USB keyboard, see your machine die.

> So, I have this non-name USB keyboard (with built-in 2-port USB
> hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.
> In retrospect, it's clear to me that the same keyboard also
> occasionally crashes 2.4 kernels, but there the problem appears
> more seldom.

Specifically, after upgrading to 2.6.4-rc2, _all_ of the ia64 machines
I tested would crash as soon as they had _any_ USB keyboard plugged
in. That is, the problem no longer was limited to the BTC keyboard,
which is special because it has a built-in hub. This was encouraging.

Turns out it's this patch that was causing the crashes:

http://linux.bkbits.net:8080/linux-2.5/cset@xxxxxxxxxxx

Maybe in 2.6.4-rc2... but not in 2.6.0-test{8,9}!!

There's something wierd going on recently for sure, and it's
not caused just by that patch. I'm not sure reverting that
would make things better overall right now; hard to say.

I'm thinking the "disable periodic schedule" patch is also
worth looking at. I can imagine some silicon having bugs
if that's turned off (just like some _systems_ have issues
when it's left on).


That was strange, because even to my USB-untrained eye the patch
looked obviously correct. However, I think the root cause of the
problem really has to do with a race-condition between the controller
and the driver. In particular, if I apply the patch below, my USB
keyboards (including the BTC keyboard) work just fine!

...
- ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
+ ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;

I've seen that _change_ behavior in some regression tests
(usbtest test11/test12), as if the extra msec let things
quiesce (so only one of two broken states showed, and
not the oopsable one) but not _fix_ it.


However, I think the root-cause of the problem may be this optimization
in ohci_irq():

/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */

Indeed, if I apply this patch instead:

...
/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
- if ((ohci->hcca->done_head != 0)
+ if (0 && (ohci->hcca->done_head != 0)
...

there are no crashes either.

See this post from Martin Diehl ... my response isn't out yet:

http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107850825815775&w=2

The reason I keep ending up thinking that readl-elimination
must be OK (me agreeing with Martin) is that the memory there
came from dma_alloc_coherent() ... so if anything's wrong,
it'd be at most lack of rmb(), not a stale-cache kind of thing.


So my theory is that I was seeing this sequence of events:

- HCD signals WDH interrupt & sends DMA to update the frame number in
the host-controller communication area (HCCA)

- host gets interrupt, but skips readl() and hence reads a stale
frame number N instead of the up-to-date value (N+1)

It reads the frame number directly from the controller, so it's
not possible that it can be so stale that an rmb() wouldn't fix
sequencing issues.

What might be possible though is that the donelist gets modified
by the time the unlinks get processed, with some extra TDs changing
state (from HC perspective) ... haven't explored that possibility.


- HCD cancels a transfer descriptor (TD), moves it to the "remove list"
and calculates the frame number at which it can be remove from
the host-controller's list as N+1

That'd be an ED on the remove list, not a TD. Also in dma-coherent
memory. The cancelation would apply to one or more of the TDs
queued to that ED.


- SOF interrupt arrives (probably was pending already?)

- interrupt handler does a readl() and now sees the updated
frame-number N+1

- HCD sees that the cancelled TD's time stamp N+1 is <= the current
current time stamp (N+1) and goes ahead and removes it from the
host-list, while the controller is still looking at the entry being
removed

This ED-level disagreement between the HC and HCD might explain
some issues. I think the current trouble cases are usually where
there's only one TD queued to the ED, so that TD and the dummy
keep ping-ponging back and forth. The HC certainly seems to
overwrite the "dummy TD" --

- HCD ends up dereferencing a bad pointer and ends up reading from
address 0xf0000000, which on our ia64 machines is a read-only area,
which then results in a machine-check abort

I'm surprised that DMA from a read-only area would be a problem! :)
If OHCI is getting a PCI error, I'd expect a "UE" IRQ.


Does this sound plausible?

Parts of it. There's definite recent nastiness. Of the type
that other eyes sometimes see better.


What beats me is why UHCI would have the same issue. I know even less
about UHCI than I do about OHCI but perhaps there is a similar
problem.

I still suspect some problem in the HID code. But right now
I'm quite certain of a recent-ish OHCI issue.

- Dave



--david



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/