[Fwd: PIII patches]

dledford@redhat.com
Wed, 09 Jun 1999 01:48:32 -0400


-------- Original Message --------
From: dledford@redhat.com
Subject: PIII patches
To: Linus Torvalds <torvalds@transmeta.com>,MOLNAR Ingo
<mingo@chiara.csoma.elte.hu>, zab@redhat.com,Dave Miller
<davem@redhat.com>,"Stephen C. Tweedie" <sct@redhat.com>

These are my current PIII patches. The first patch is the base PIII/FXSAVE
support. It was done along the lines of what Linus suggested (and I agreed
with) in that we don't fiddle around doing FXSAVE to FSAVE conversions. We
use fnstenv and fldenv in order to get the various control bits that we need
and then simply copy the 80bit fpu register values into the user_i387 struct.
It also includes a few PIII/XMM optimized functions to speed kernel
operations. These are compile time options, but once selected at compile time
they do still carry a runtime burden. It is impossible to get around that
runtime burden without doing two things. 1) change all instances of memcpy(),
memset(), copy_*_user(), and any other XMM optimized functions that are run
prior to the setup of the cr4 register on the CPU to not be the XMM version.
2) require that the XMM optimized kernel only be run on PIII and higher CPUs.
I didn't want to try and tackle the work involved in #1 (and the future work
as people will violate that rule at some point and machines will spontaneously
reboot or oops) and I don't personally like #2. Secondly, XMM optimized
functions are a loss below a certain size of operation, so you have to incur
some burden to check for data size anyway, so getting rid of the runtime
burden listed above is a false optimization in my opinion. In this patch I
did optimizations for the memset(), memcpy(), copy_to_user(), and
copy_from_user() functions. Someone could still do the various TCP/IP
checksum routines as XMM optimized functions. Someone (Stephen, you mentioned
this) could add an idle thread page zeroing option on PIII CPUs since this is
now non-cache polluting. There are other things that could be done as well.
I'm interested to find out what people think of this and also if people could
get some real world benchmarks of how this code performs.

Caveats: This patch doesn't set the OSXMMEXCEPT bit in cr4 since we don't
have the new exception handler yet. This needs written. Once this is
written, the various tests that are current something like

if ( (len > 95) &&
(boot_cpu_data.x86_capability & X86_FEATURE_XMM) &&
(boot_cpu_data.cr4_features & X86_CR4_OSFXSR) ) {
}

can be compressed to test just the len and features for X86_CR4_OSXMMEXCEPT.

The second patch I have attached is a patch against Ingo's raid code (not the
stock kernel raid code). Between the PIII optimizations and the unrolling of
the xor function, I've gotten over 20MByte/s sustained throughput improvements
in Raid5 speeds on a system where disks aren't the bottleneck, the CPU and
memory are.

Caveats on the second patch: It needs the sparc sections updated to also have
unrolled loops :) Ohhhh Daaaave ;-) I freely admit to not knowing anything
about sparc asm stuff, nor caring to learn at this point in time. The
MAX_XOR_BLOCKS could be increased on sparc if Dave so chose to do so. The
current limit is set because of the max number of blocks we can handle on
Intel in a single pass without spilling unspillable registers according to
gcc. Second issue, I reordered a bunch of the p5_mmx routine in my patch.
Now the p5_mmx routine is faster than the pII_mmx routine on my PIII. I'm not
sure if that's because I made the pII_mmx routine slower or the p5_mmx routine
faster or if they are about the same speed and the p5_mmx routine has the
benefit of cached blocks. I'm also not sure how it actually does on a p5 :)
If someone could enlighten me on whether this code is faster or slower for the
p5_mmx routine on a p5_mmx CPU I would appreciate it. Of course, please note
that the speed test done at bootup is for only two blocks. For the case of 3
to 5 blocks, this stuff is going to be faster regardless.

So, I welcome any comments. Linus, how do these look to you? Personally, I
would like to quit maintaining them separately, so if there are issues you
have with the first patch, let me know and I'll see if I can get it fixed.
It's currently against 2.2.9 BTW. Has it been tested? I've been running with
the first patch for over a month now (mostly, I've done code reordering and
other tweaking in that time, but the base code has been done for a while) and
the second patch just almost as long.

As a note to others that might wish to try XMM optimizations. Please keep in
mind that if things like the prefetch are working properly, then your results
may not be what you expect. You can reorder the instructions in the kni_xor
function to get a higher bootup benchmark from the raid5 code. But, if you
do, you will actually be slowing down the routine. The fact that the code
benchmarks slightly slower than the pII_mmx routine is a side effect of the
fact that the cache avoiding instructions are working. The higher numbers you
get by reordering the prefetch instructions are a side effect of causing the
prefetch to fail and making the blocks get faulted into the L2 cache. In
order to truly test code ordering, you need to go into userspace and do large
enough tests that they get around cache entirely so you can see accurate
results when your prefetch code is hitting or missing the target. For those
that don't know, the prefetchnta instruction and other prefetch instructions
on the PIII processors is *very* timing/location sensitive. Moving the
instruction up one line or down one line in a loop can make the difference
between getting it right or causing it to fail. All of the functions that I
optimized included testing to find the exact point at which the prefetch
instructions worked the best.

OK, that's it for now. I didn't Cc: this to linux-kernel simply because of
the size and I don't like to spam this much patch bloat through the list.
I'll repost this note there with a pointer to the patch files so that others
may grab them from my site.
-------- End Original Message --------

As per my last remark. The patches can be found at
http://www.redhat.com/~dledford/linux_kernel.html

The first patch I talked about is named PIII.patch and the second one is the
PIII-raid.patch. Enjoy.

Note: Don't bother to let me know about broken links on those pages or
anything else, as far as I'm concerned those are just a temporary delivery
mechanism and they aren't intended to be around long.

-- 
  Doug Ledford   <dledford@redhat.com>
   Opinions expressed are my own, but
      they should be everybody's.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/