Re: Kernel virtual memory?

Mark Hemment (markhe@nextd.demon.co.uk)
Fri, 8 Aug 1997 17:16:58 +0100 (BST)


On Fri, 8 Aug 1997, Victor Yodaiken wrote:
> Kernel pages should, in general, stick in place.
> It's one thing to move per-process
> pages in a process that is only running on one processor. It is something
> else to move a kernel global structure, forcing all cpus to invalidate
> tlbs at a common sync point. So the problem is not how you change the
> physical page bits in the PTE, the problem is the consequence of
> changing those bits.
> I'm particularly concerned with implications for Real-Time, although there
> are obvious performance implications for non-rt as well. One of the
> big wins for monolithic kernel design over micro-kernel is that it is
> much easier to get a high tlb hit rate in the monolithic kernel.
> If kmalloced pages can move, however, each move forces a tlb flush
> + global spinlock.

I'm not quite sure of your terms here, so I'll waffle for a bit in the
hope of answering your point.
[Please flame my mistakes, I wouldn't want to mis-inform anyone -
thanks!]

The kernel maintains a one-to-one, linear mapping of virtual address to
physical address. pa=va-PAGE_OFFSET (on Intel, where PAGE_OFFSET is
0xC0000000).
This mapping is never broken for the kernel address space, so we cannot
move 'kernel-pages', but we can move 'user-pages'.
A kernel-page can be considered as a work-page used internally the
kernel, such as for "struct files_struct". These pages are allocated
either directly from the page-allocator (get_free_pages() and friends), or
indirectly via the SLAB allocator (or, now, also via the SIMP).
A 'user-page' is a page which is (or can be) mapped into a task's
address-space. This might be a page read from a file (known as a named
[or inode] page), an anonymous-page, or Shm page. These pages
are know as pagable, as they can be removed from physical memory. This
removal may cause a write to a file, or swap. It may also cause updating
of PTEs which refered to the page [to tell the page-fault handler where to
find the page].
I say, "may cause updating of PTEs" because the page being removed might
not have many current mappings - that is, it is a cached-page.

When an allocation is made for a non-zero page-order [more than one
physically contigious pages], there might not be a large enough
memory hole [of unallocated pages] to satisfy the requirement. To make
the hole, there are two choices;
1) Destroy a page so the physical page can be coalesced to form
the require contigious area. This destroying may involve a
write to file/swap, and updating of PTEs.
2) Move a page to form the required contigious area.
This involves copying the contents to another [unallocated]
page, and updating the PTEs to refer to this replacement page.
(and updating any other linkage, as for the page-cache).

Presently, only 1) is used. It destroys pages blindly in the hope that
it will release a requried page.

Now, there are other pages in the kernel which can also be destroyed,
but _not_ moved (or at least it doesn't make any sense to move them).
A good example of these are the buffer-pages used in fs/buffer.c, another
example is pages allocated to the SLAB which contain no active objects.

NOTE: kmalloc()ed page(s) which active objects _cannot_ be destroyed!!!

[Of course, we also have page-tables which can be paged to swap.
Although this is not currently done, and has some nice races to take
care off.]

As you know, after updating a PTE the TLB needs to be flushed. On
Intel, it is possible to invalidate a single TLB entry - which is what is
done now. Note, the invalidation is needed if a page is destoryed _or_
moved.

For SMP, things are a bit more complicated. The TLBs on the engines
need to be kept consistent.

> For RT there are three issues. First, invalidating the tlb will hammer
> response time for a rt-task (I'm very curious about cyrix where
> you can lock in some cache lines). Second, what happens if a
> tlb invalidate irq is asserted during the execution of a rt-task?

Where pages are being destroyed or moved, the TLB needs to be
invalidated. The extra overheads in moving a page are;
1) The cycles used for the copy.
2) Cache polution from the copy.

Where there are plenty of free pages, but not the necessary
"contigiously free pages" moving would probabily be more efficient
over-all - it would avoid a possible dirty-write, and a possible
[later] page-in.
The PCD could be used to avoid the cache-polution, but it might only be
useful on the destination page where the architecture write-allocates
cache lines (Pros, I believe, write-allocate.).

The flush_tlb_[page|range|mm] macros for SMP are intelligent, and try to
avoid cross-engine flushes. I guess it is possible [as a comment in
asm-i386/pgtable.h says] to be a bit more clever and avoid some flushes.
The smp_flush_tlb() does flush the entire TLB on the other engines,
rather than a selected entry. This makes the smp-message light (as it
carries no-data).
I believe David S. Miller as some designs/ideas on reducing cross-engine
flushing (via the cpu_vm_mask). It's probabily already in the Sparc port.

> The solution I want to implement would be for the invalidate irq
> handler to promise to flush and return to the rt-task. Then before
> we return to linux-proper, we would actually commit the flush. This makes
> it easy to make smp-invalidate "virtually" non-maskable.

Hmm, I guess you mean delayed shootdowns. SVR4 does when managing
kernel virtual address; the segkmap segment. Or at least I think this is
where it uses them....

> All this assumes that pages used by the rt-task will never be moved
> by Linux-proper. If this assumption is not correct, then rt becomes
> impossible on Linux because a non-rt process on one cpu will
> effectively be able to stop a rt-task on a second cpu.

Most page allocations are for single pages. Under this condition, there
is no need to move pages. (All pagable pages are allocated as single
pages). There are a few single-page DMA allocations, which might cause a
page-move, but nothing serious.
"rt becomes impossible on Linux". Wow, strong statement!!
As I've said, there are not many allocations which need contigious/DMA
pages. Not all flush_tlb_page(), cause a call to smp_flush_tlb().
A smp_flush_tlb() is expensive, and to be avoid then every possible, but
it certainly not going to make RT impossible. Infact, better memory
management makes RT more do-able.

> But this is precisely the expensive and problematic synchronization I
> want to avoid.

The ideal solution, to this and may other problems, is to place the onus
on the exception - which should happen rarely. That is why I thought of
removing the reference bit, to force a page-fault. What can seem ugly
when written in words, can _sometimes_ be implemented cleanly. And it is
the cleaness which is important.

At the moment, I'm more concerned with the Fifth Ashes Test (important
cricket match) than Memory Management :)

Regards,

markhe

------------------------------------------------------------------
Mark Hemment, Unix/C Software Engineer (Contractor)
markhe@nextd.demon.co.uk http://www.nextd.demon.co.uk/
"Success has many fathers, failure is a B**TARD!" - anon
------------------------------------------------------------------