Re: Semaphore assembly-code bug

From: Jeff Garzik
Date: Fri Oct 29 2004 - 17:29:12 EST


Linus Torvalds wrote:

popl %eax
popl %ecx

should one cycle on a Pentium. I pretty much _guarantee_ that

lea 4(%esp),%esp
popl %ecx

takes longer, since they have a data dependency on %esp that is hard to break (the P4 trace-cache _may_ be able to break it, but the only CPU that I think is likely to break it is actually the Transmeta CPU's, which did that kind of thing by default and _will_ parallelise the two, and even combine the stack offsetting into one single micro-op).


One of my favorite "optimizing for Pentium" docs is

http://www.agner.org/assem/pentopt.pdf
from
http://www.agner.org/assem/

which is current through newer P4's AFAICS.

It notes on the P4 specifically that LEA is split into additions and shifts. Not sure what it does on the P3, but I bet it generates more uops in addition to the data dependency.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/