Re: Lazy FPU gone from x86 platform --> probably not worth it anyhow

Adam J. Richter (adam@yggdrasil.com)
Wed, 29 Jul 1998 05:27:00 -0700

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: MOLNAR Ingo: "Re: Corruption Stats (fwd)"
Previous message: David S. Miller: "Re: Corruption Stats (fwd)"
In reply to: Sid Boyce: "Re: VESA VGA Frame Buffer"

I wrote:
> I see that the lazy FPU save and restore optimization have
>been removed from the latest 2.1 kernel. The comment in
>arch/i386/kernel/process.c says "Lazy FP saving no longer makes any
>sense with modern CPU's, and this simplifies a lot of things (SMP
[...]
>So, I am
>rather curious to know if lazy FPU save and restore is worthwhile or not,
>or if the truth is really not known.

In case anybody is wondering how this turned out, I unpacked
an old 2.1.95 kernel and built it with and without the lazy FPU
optimizations. I then ran a benchmark that passes a character from a
parent process to child and back by reading and writing through pipes,
and can optionally make either process execute a floating point
instruction on each iteration. This is about the most favorable
possible benchmark for lazy FPU save and restore.

On a single 300MHz Pentium II, I got the following results by
running this loop 5,000,000 times (to avoid floating point overflow, I
did this by running the loop 500,000 times per invocation of the
program, and invoking the program ten times, and then summing the time
figures). All times are given in seconds. These are the resources used
by the parent process.

user system (user+system) elapsed time
noopt/2.1.95--no-fp 6.300000 42.160000 (48.460000) 97.170000
noopt/2.1.95--child-fp 6.540000 39.230000 (45.770000) 91.190000
noopt/2.1.95--parent-fp 8.770000 46.220000 (54.990000) 100.850000
opt/2.1.95--no-fp 6.100000 39.560000 (45.660000) 92.150000
opt/2.1.95--child-fp 6.630000 40.860000 (47.490000) 94.750000
opt/2.1.95--parent-fp 6.790000 40.960000 (47.750000) 94.950000

Savings from opt (with lazy FPU optimzations) versus noopt (no lazy FPU):
All times are listed in seconds

user system (user+system) elapsed time
no fp: 0.200000 2.600000 (2.800000) 5.020000
no fp: 3.174603% 6.166983% (5.777961%) 5.166204% (% improvement)
child fp: -0.090000 -1.630000 (-1.720000) -3.560000
child fp: -1.376147% -4.154983% (-3.757920%) -3.903937% (% improvement)
parent fp: 1.980000 5.260000 (7.240000) 5.900000
parent fp: 22.576967% 11.380355% (13.166030%) 5.850273% (% improvement)

The best improvement occured when the parent process (the one
being measured) executed a floating point instruction on each loop.
The lazy FPU load and restore optimization here saved 7.24 seconds
across 5,000,000 iterations, or 1.448 microseconds (1448 nanoseconds,
434 CPU cycles) per context switch by the parent process, or a 13%
improvement.

It is probably more fair to sum the results of child-fp and
parent-fp to get a better estimation of the number of cycles saved
on the system as a whole, especially since the child-fp results show
negative savings, probably meaning that the accounting of lazy FPU
save or restore costs are being moved the non-floating point process.

child fp user+system savings -1.72 seconds
parent fp user+system savings 7.24 seconds
------
5.52 seconds saved in 10,000,000 iterations
= 0.552 microseconds (552ns) /iteration
= 165 CPU cycles

optimized child fp user+systems total time 47.49 seconds
optimized parent fp user+systems total time 47.75 seconds
------
95.24 seconds

5.52 seconds saved / 95.24 seconds = 5.8% performance improvement

Admittedly, this is an extremely contrived benchmark. It
basically represents the maximum possible performance improvement
available from lazy FPU and restore on a 300MHz Pentium II.

I believe that most floating point programs do not cause
context switches very often. So, I imagine that they run for the full
10,000 microsecond (1/100th of a second) time slice, of which, this
optimization could save 0.552 microseconds of 1 / 20,000. So, I
suspect that there is not much a throughput argument for this
optimization for many real world applications. Please correct me if I
am wrong; that is part of why I posting this.

On the other hand, there is something to be said for enabling
the fastest possible context switches, and this looks like it can shave
5% off of the upper bound for that operation.

Anyhow, my inclination at this point is not to bother updating
the lazy SMP lazy FPU save and restore patch unless anyone think it
really would be worth having.

P.S. For completeness, I have attached a copy of the benchmark program
below.

Adam J. Richter __ ______________ 4880 Stevens Creek Blvd, Suite 205
adam@yggdrasil.com \ / San Jose, California 95129-1034
+1 408 261-6630 | g g d r a s i l United States of America
fax +1 408 261-6631 "Free Software For The Rest Of Us."
--------------------------------CUT HERE---------------------------------
#include <stdio.h>
#include <sys/types.h>
#include <signal.h>

int pipe1[2];
int pipe2[2];
int *my_input, *my_output;
int parent_fp = 0;
int child_fp = 0;

static void
context_switch(void) {
char buf;
if (read(*my_input, &buf, 1) != 1) {
perror("read");
exit(1);
}
if (write(*my_output, &buf, 1) != 1) {
perror("write");
exit(1);
}
}

void
child_process(void) {
float tmp;

if (child_fp) tmp = 1.0;
my_input = &pipe1[0];
my_output = &pipe2[1];
printf("child_fp = %d.\n", child_fp);
fflush(stdout);
for(;;) {
if (child_fp) {
tmp = tmp * 1.00001;
}
context_switch( );
}
if (child_fp) printf ("child tmp = %f.\n", tmp);
fflush(stdout);
}

int
parent_process(int child, int count) {
float tmp;

if (parent_fp) tmp = 1.0;
my_input = &pipe2[0];
my_output = &pipe1[1];

printf("parent_fp = %d.\n", parent_fp);
fflush(stdout);
if (write (*my_output, "x", 1) < 0) { /* prevent deadlock */
perror ("initial write");
exit(1);
}

while(count-- > 0) {
if (parent_fp) {
tmp = tmp * 1.00001;
}
context_switch();
}
if (parent_fp) printf ("parent tmp = %f\n", tmp);
kill(child, SIGKILL);
return 0;
}

int
main( int argc, char **argv) {
int count;
int pid;

if (argc < 2) {
fprintf (stderr, "Usage: benchmark [--child-fp] [--parent-fp] iteration_count\n");
exit(1);
}
if (pipe(&pipe1) < 0 || pipe(&pipe2) < 0) {
perror("pipe");
return 1;
}
if (strncmp(argv[1], "--child-fp") == 0) {
argc--;
argv++;
child_fp = 1;
}
if (strncmp(argv[1], "--parent-fp") == 0) {
argc--;
argv++;
parent_fp = 1;
}
count = argc == 2 ? atoi(argv[1]) : 10000;
switch(pid = fork()) {
case 0:
child_process();
exit(0);
/*NOTREACHED*/
case -1:
perror("fork");
exit(1);
/*NOTREACHED*/
default:
return parent_process(pid,count);
}
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html

Next message: MOLNAR Ingo: "Re: Corruption Stats (fwd)"
Previous message: David S. Miller: "Re: Corruption Stats (fwd)"
In reply to: Sid Boyce: "Re: VESA VGA Frame Buffer"