OK, finally a bit of progress. If you remember back in October 2003 I
reported:
> One-line summary: plug-in your USB keyboard, see your machine die.
> So, I have this non-name USB keyboard (with built-in 2-port USB
> hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.
> In retrospect, it's clear to me that the same keyboard also
> occasionally crashes 2.4 kernels, but there the problem appears
> more seldom.
Specifically, after upgrading to 2.6.4-rc2, _all_ of the ia64 machines
I tested would crash as soon as they had _any_ USB keyboard plugged
in. That is, the problem no longer was limited to the BTC keyboard,
which is special because it has a built-in hub. This was encouraging.
Turns out it's this patch that was causing the crashes:
http://linux.bkbits.net:8080/linux-2.5/cset@xxxxxxxxxxx
That was strange, because even to my USB-untrained eye the patch
looked obviously correct. However, I think the root cause of the
problem really has to do with a race-condition between the controller
and the driver. In particular, if I apply the patch below, my USB
keyboards (including the BTC keyboard) work just fine!
...
- ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
+ ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;
However, I think the root-cause of the problem may be this optimization
in ohci_irq():
/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
Indeed, if I apply this patch instead:
...
/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
- if ((ohci->hcca->done_head != 0)
+ if (0 && (ohci->hcca->done_head != 0)
...
there are no crashes either.
So my theory is that I was seeing this sequence of events:
- HCD signals WDH interrupt & sends DMA to update the frame number in
the host-controller communication area (HCCA)
- host gets interrupt, but skips readl() and hence reads a stale
frame number N instead of the up-to-date value (N+1)
- HCD cancels a transfer descriptor (TD), moves it to the "remove list"
and calculates the frame number at which it can be remove from
the host-controller's list as N+1
- SOF interrupt arrives (probably was pending already?)
- interrupt handler does a readl() and now sees the updated
frame-number N+1
- HCD sees that the cancelled TD's time stamp N+1 is <= the current
current time stamp (N+1) and goes ahead and removes it from the
host-list, while the controller is still looking at the entry being
removed
- HCD ends up dereferencing a bad pointer and ends up reading from
address 0xf0000000, which on our ia64 machines is a read-only area,
which then results in a machine-check abort
Does this sound plausible?
What beats me is why UHCI would have the same issue. I know even less
about UHCI than I do about OHCI but perhaps there is a similar
problem.
--david