Re: Intel P6 vs P7 system call performance

From: Rogier Wolff (R.E.Wolff@BitWizard.nl)
Date: Wed Dec 18 2002 - 07:57:22 EST


On Tue, Dec 17, 2002 at 11:10:20AM -0800, Linus Torvalds wrote:
>
>
> On Tue, 17 Dec 2002, Ulrich Drepper wrote:
> >
> > But this is exactly what I expect to happen. If you want to implement
> > gettimeofday() at user-level you need to modify the page.
>
> Note that I really don't think we ever want to do the user-level
> gettimeofday(). The complexity just argues against it, it's better to try
> to make system calls be cheap enough that you really don't care.

I'd say that this should not be "fixed" from userspace, but from the
kernel. Thus if the kernel knows that the "gettimeofday" can be made
faster by doing it completely in userspace, then that system call
should be "patched" by the kernel to do it faster for everybody.

Next, someone might find a faster (full-userspace) way to do some
"reads"(*). Then it might pay to check for that specific
filedescriptor in userspace, and only call into the kernel for the
other filedescriptors. The idea is that the kernel knows best when
optimizations are possible.

Thus that ONE magic address is IMHO not the right way to do it. By
demultiplexing the stuff in userspace, you can do "sane" things with
specific syscalls.

So for example, the code at 0xffff80000 would be:
        mov 0x00,%eax
        int $80
        ret

(in the case where sysenter & friends is not available)

moving the "load syscall number into the register" into the
kernel-modifiable area does not cost a thing, but because we have
demultiplexed the code, we can easily replace the gettimeofday call by
something that (when it's easy) doesn't require the 600-cycle call
into kernel mode.

The "syscall _NR" would then become:

        call 0xffff8000 + _NR * 0x80

allowing for up to 0x80 bytes of "patchup code" or "do it quickly"
code, but also for a jump to some other "magic page", that has more
extensive code.

(Oh, I'm showing a base of 0xffff8000: A bit lower than previous
suggestions: allowing for a per-syscall entrypoint, and up to 0x80
bytes of fixup or "do it really quickly" code.)

P.S. People might argue that using this large "stride" would have a
larger cache-footprint. I think that all "where it matters" programs
will have a very small working-set of system calls. It might pay to
use a stride of say 0xa0 to spread the different
system-call-code-points over different cache-lines whenever possible.

                Roger.

(*) I was trying to pick a particularly unlikely case, but I can even
see a case where this is useful. For example, a kernel might be
compiled with "high performance pipes", which would move most of the
pipe reads and writes into userspace, through a shared-memory window.

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an      * 
* excursion: The stable situation does not include humans. ***************
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Dec 23 2002 - 22:00:19 EST