I agree that we want a simple and scalable ABI.
> Instead, we should just have different classes of system calls:
[...]
Maybe this is trying to abstract it too much? Putting different
classes of system calls into the ABI seems awkward to me.
I propose a much simpler abstraction: set up a global page (which
always appears at a fixed address in user-space), and set up a jump
table. Have one jump vector per system call. That's the ABI. End of
story.
On a dumb CPU, each vector points to a piece of code that implements
the standard syscall. On better CPUs, some syscall vectors will point
to code that uses a better syscall interface (like sysenter). And some
syscall vectors (i.e. gettimeofday(2)) will point to code that reads
some kernel data in the global page, without any switch to kernel
space.
Now we can optimise syscalls on a case-by-case basis, rather than
trying to solve all the problems.
If you are concerned about multiple jumps, maybe each entry in the
table can implement the standard syscall interface, as long as they
don't take too many instructions.
All user-space has to know is that syscall N is made by making a
function call to BASE+k*N, where k is the size of the jump vectors.
And whether parameters are passed in registers or on the stack.
BASE and k are fixed for all time.
> If people don't like the page mapping idea, then come up with a
> better way, but don't beat the dead horse of exposing SYSENTER.
I'm happy with the page mapping idea, but what concerns me is that we
can end up with a kernel which has a fair bit of code data embedded in
it, due to the increasing number of syscall instructions. Even if it's
contained in __init sections, it still bloats the kernel image. This
is a particular problem with embedded systems. Config options will
help here, but we have too many of those already.
So I suggest a few possibilities for working around this:
1) the kernel reserves a page (or pages) which is mapped into each
process at the same VA, but is written to by user-space. Perhaps a
syscall/ioctl to write-protect the region once initialised
2) have a single module which contains all the code data variants and
writes the appropriate selection to the global page(s)
3) have a collection of modules, one for each CPU (implementation)
type, and user-space picks the correct one to load. The module
initialises the global page.
The advantage of (1) is that it's simple (minimum kernel bloat). The
disadvantages are that it separates code from data (the
gettimeofday(2) case), which Linus doesn't like, and that it requires
some user-space code to use the standard syscall interface (since the
global page won't be initialised yet).
The advantage of (2) is that it keeps the kernel image small. The
disadvantages are that it doesn't separate code from data, it's still
a big module to carry around for embedded systems, and also some
user-space code has to use the standard syscall interface.
The advantage of (3) is that keeps the kernel small and is also
friendly to embedded systems. It also suffers from the code/data
separation and some user-space code having to use the standard syscall
interface.
Regards,
Richard....
Permanent: rgooch@atnf.csiro.au
Current: rgooch@ras.ucalgary.ca
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/