Re: Memory Management - BSD vs Linux

Linus Torvalds (torvalds@transmeta.com)
12 Aug 1997 17:04:13 GMT


In article <199708121554.QAA12320@sable.ox.ac.uk>,
Malcolm Beattie <mbeattie@sable.ox.ac.uk> wrote:
>Douglas Jardine writes:
>> I missed a couple of questions in my last mail:
>>
>> [7] In order to be able to run over different architectures, Linux
>> implements a 3-level page table. It then rolls the architecture
>> specific stuff into this 3-level organization. For example x86
>> has only 2-level page tables but these are appropriately munged
>> into the 3-level organization. My question is that, are there
>> any architectures out there for which this sort of transformation
>> won't work? i.e does the transformation take away enough from
>> the architectures strengths that other hacks are needed to be
>> able to get reasonable performance?

There are no architectures for which this doesn't work.

The reasoning is very simple: the Linux kernel three-level page tables
have NOTHING to do conceptually with the page tables that the CPU
actually uses.

So what happens is that the kernel uses the simple three-level page
table approach, and then it depends on the architecture-specific mmu
layer to fill in the CPU-specific page tables.

The logic is that any CPU-specific page tables are just an extended TLB,
and are treated as such by the Linux kernel. When a page fault occurs,
the kernel will walk the three-level page table and fill in the "TLB"
(even if that TLB is then some strange architecture-specified page table
like the hash tables on rs6000).

Now, as a performance enhancement, the kernel allows the
architecture-specific definitions of the kernel three-level page table
to be modified so that the kernel page table looks more like what the
CPU wants to have for its TLB refill. The fact that this mapping is so
good that in many cases we can just directly re-use the kernel page
tables as the hardware page tables is obviously very clever of me, but
it doesn't change the basic fact that _conceptually_ they are two
different things.

>I wondered about that too. An interesting example would be the
>inverted page tables of the RS6000 (and isn't PA-RISC either the
>same or weird in a similar way: I know there's DVMA fun for PA-RISC).

The RS6000 doesn't have inverted page tables. It has a screwy hash
table that IBM calls an inverted page table, but it isn't. It's a simple
extension on the on-chip TLB, and should be treated as such. IBM is just
calling it something else to make it look better.

(Definition of page table: something that maps virtual addresses to
physical addresses. The IBM hashtables do NOT match that definition,
because they can only map _parts_ of the virtual address space. As such
they match the definition of a TLB: a "cache of virtual->physical
translation entries". I dare any IBM blockhead to try to refute this
without looking silly).

>Since nobody has tried porting Linux to RS6k (that I know about) and
>the current PA-RISC support is on top of a microkernel (or is there
>something native now?), there's never been a case where the Linux
>pgd/pmd/pte hasn't been general enough (whereas SVR4 HAT and
>*BSD/Mach would be).

Linux is able to handle the RS6000 hash tables quite well, thank you
very much. They exist on the PowerPC too, and there is already a native
port to that.

In fact, the Linux page table setup is a lot MORE versatile and cleaner
than the SVR4 HAT. Compare the amount of code needed in Linux to handle
different VM architectures (Sparc, i386, alpha, you name it) to the HAT
version. The Linux version is a lot smaller, and I'll bet you that it's
a lot faster too.

> Looked at another way: all the main architectures
>*are* covered by the Linux solution (PA-RISC is a "maybe" I suppose)
>so, since the Linux one gives enough generality to cover them and also
>gives performance benefits (less indirection and more opportunity for
>the compiler to optimise), it was a good design decision.

No. It was a good design decision, FULL STOP.

Yes, the Linux approach to page tables is faster than anybody else. But
no, that speed is not because it sacrificed generality, it's because I'm
a better designer than all the other people that designed virtual memory
subsystems. Sorry, but that's just how it is.

I'm not only exceedingly clever, I'm also a conceited little bugger,
ain't I?

>(But I'd still like to know if anyone has a mapping worked out for
>RS6k :-).

Just fetch any of the later development kernels, and look into the ppc
directories (include/asm-ppc/pgtable.h, arch/ppc/kernel/head.S and
arch/ppc/mm/fault.c).. Close your eyes and imagine that "ppc" really
says "rs6000" and you'll be home free.

And I'll bet that even on the RS6000 the Linux architecture-specific
support is just a small _fraction_ of what anybody else has to go
through.

Linus