Re: signal(SIGFPE,SIG_IGN), possible solution?

Linus Torvalds (torvalds@cs.helsinki.fi)
Fri, 26 Apr 1996 11:58:46 +0300 (EET DST)


On Thu, 25 Apr 1996, Ben Wing wrote:
>=20
> Linus (who has obviously gotten a bit frazzled from working so hard
> on the kernel lately) sees fit to rant:

guilty as charged..

> You've sent this whole long flame, but completely missed the point
> that I was *NOT* talking about division by zero, but rather about
> overflow -- e.g. if I divide 0xFFFFFFFFFFFFFFFF (a 64-bit number)
> by 0xFFFFFFF (a 28-bit number) using the idivl instruction, the
> processor issues an exception because the result does not fit into
> 32 bit, but takes 36 bits. Returning MAX_INT here is NOT random --
> it's the closest reasonable approximation.

You don't see the problem, do you?

I did notice that you were talking about overflow rather than division =
by=20
zero, but that doesn't really change anything. It's the same thing: do =
we=20
accept software that gives unpredictable results or not?

For _you_ it is acceptable to get a rounding error.

For somebody else it may not be. And the kernel has to assume the worst=
,=20
for the simple reason that it can't _tell_ what the program wants. I ha=
ve=20
yet to add the ESP-driver to the kernel to read the mind of the user..

> And yes, I had considered longjmp()ing out of a signal handler.
> However, this idivl instruction occurs HUNDREDS, maybe thousands,
> of times in the renderer. Can you imagine the pain involved in
> ensuring there was a setjmp() everywhere?

You don't need to do that. In fact, you don't _want_ to do that, for th=
e=20
same reason you don't want to check the arguments to the division.

Now, you're obviously using inline assembly or something, as I don't=20
think gcc will ever compile any division to do the 64/32->32 thing. If=20
you're ready to do that kind of thing, then you must be ready to play=20
around a bit in a signal handler or play with longjump.

In a signal handler, you could even do

#include <asm/sigcontext.h>

void sigfpe_handler(int signr, struct sigcontext context)
{
fixup(&context);
}

Where the "fixup()" routine does a disassembly of the %eip that faulted=
,=20
and jumps over it. (Depending on how you have written your inline macro=
,=20
you may know that the division is always done with a register, so then=20
the "fixup" can be as simple as just doing

context.eip +=3D 2;
context.eax =3D ~0;
context.edx =3D ~0;

and that's it..

Notice that by NOT doing in the kernel, you win, because
- your program can know when the overflow occurs, and in some cases=20
that's important so that it can mark a certain pixel as having=20
overflowed.=20
- the kernel doesn't have to worry about what the user wants to do thi=
s=20
time.
- because you know what you're doing, you don't need to do extra work=20
like the kernel would have had to (you can skip the disassembly, for=
=20
example.
- you have a chance in hell of porting your program to some other=20
platform in the future (the linux signal stack bears a remarkable=20
similarity to the IBCS2 standard x86 unix signal stack)
- you can round the result any way _you_ want to. In fact, you can pla=
y=20
games to make it very easy for youself by having a fixed sequence of=
=20
instructions to handle it.

And because you handle it youself, you have any flexibility you want to=
=20
with error handling. For example, you can make the fault handler jump t=
o=20
some specified point in your function with something like this:

__label__ error_handler;
__asm__("divl %2"
:"=3Da" (low), "=3Dd" (high)
:"g" (divisor), "c" (&&error_handler))
... do normal cases ...

error_handler:
... check against zero division or overflow, so whatever you want to .=
.

Then, your handler for SIGFPE needs only to do something like

context.eip =3D context.ecx;

and there is no overhead at all for taking a fault and _knowing_ about =
it=20
for the normal case when you don't fault (well, the __asm__ statement=20
sets up %ecx to point to the fault handler, but that=E4s one instructio=
n=20
and one register, so it may well be worth it for you).

In short, you've been barking up the wrong tree all the time. Instead o=
f=20
trying to ignore SIGFPE which is arguably totally idiotic, you should=20
_handle_ them. If your application is speed critical, then the handling=
=20
might be something like the above (whcih is certainly not pretty, but a=
t=20
least it's _clever_).

And if it isn't _that_ speed critical, then you can do it portably righ=
t,=20
and use siglongjmp().

Notice? By doing it in the user process, you have the _choice_, and you=
=20
can do it right. If Linux did it in the kernel, you could never do it=20
right for everybody..

> Do I REALLY have to set up my own signal handler that looked at
> the assembly and stepped the program counter over the instruction?
> Isn't that more than a bit absurd?

"more than a bit absurd"?

That's _exactly_ what you asked _me_ to do in the kernel.. How does it=20
feel to have the tables turned on you?=20

You essentially asked me to do the same "absurd" thing, but for no real=
=20
reason, and from the kernel which is unpageable and where every little=20
piece of memory _stays_ in memory even though 99.95% of all programs=20
don't care or even _want_ this functionality..

THAT is why I think people have no grip on reality on this thing.=20

> Is it braindamage if I expect that if I say 'signal (SIGFPE, SIG_IGN)=
'
> then the machine will ignore the SIGFPE and continue in its merry
> way? If the only way to get this braindamage is to **explicitly
> request it**, how can it possibly make all other programs unsafe?

It's braindead, because you're confusing the act of handling a signal=20
with the act of _generating_ one.

When your program does a "signal(SIGFPE, SIG_IGN)", that means that it=20
will ignore any signals sent to it, and the kernel honours that.

That does not mean that the kernel should stop generating them (or try =
to=20
make the hardware stop generating them). You told the symptoms to go=20
away, but you didn't fix the problem - why do you expect the problem to=
=20
go away?

Oh, btw, this discussion _has_ resulted in something. As of 1.3.96, the=
=20
kernel will totally ignore and override any signal blocking and/or=20
SIG_IGN for errors that it can't (or won't) recover from.

"My name is Linus Torvalds, you messed with my kernel, prepare to die=
"

Linus