I wonder if you're getting TLB thrashing? Does the software touch a
wide range of addresses? It could be that the Win 95 driver is using a
single 4MB page mapping, while under Linux its getting a whole pile of
4k pages, each of which will need its own TLB entry. If your memory
accesses bounce all over the place in virtual memory, you could be
trashing the mappings in your TLB. Is it possible to use the 4MB page
extention for mapping the hardware into a process address space? Does X
use it for mapping in video-cards?
On the other hand, if it is using 4MB pages, it could do it with the
plain Pentium too, so you'd see a similar speed effect there. Except
that servicing a TLB-miss might be relatively more expensive on a
PPro/PII - this would be consistent with your Pentium being a little
slower, but the PPro/PII being much slower.
I gather the PII/PPro has a comprehensive set of registers for
monitoring performance problems like this, including a count of TLB
misses, cache misses, pipeline bubbles and so on. Maybe you can use
them to get a grip on what's happening. I think the Pentium only has
cycle times, but that would at least tell you whether your code is
taking longer than it should (though you know that already, I guess).
J