It's also possible that we are seeing changes that affect his machine's
cache+memory+CPU implementation. Depending on the type of memory
cacheing algorithms, rearranging a few bytes of critical kernel code
here and there might make a 0.1% difference in overall performance.
Such issues become quite obvious when you start comparing inlined vs.
conventional functions, rolled vs. unrolled loops, etc. Large OS vendors
know this--they have tools that do nothing but rearrange code so that it
"fits" on the hardware better.
To do these tests properly would require several different types of
machines, different CPU's (remember that Pentiums have
counterintuitively low BogoMips ratings), bus architectures, memory,
disks...Then there's the possible effects of the compiler version,
libc, and other components of the software run-time environment to
consider.
By benchmarking someone's unknown hardware, we can hack up GCC and
friends to rearrange the kernel binary to run maybe 2% faster on it...
Since I generally favor correctness over performance, I say we fix the
bugs first. ;-)