Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit

From: H. Peter Anvin
Date: Thu Oct 04 2012 - 17:52:57 EST


On 10/04/2012 06:56 AM, Konrad Rzeszutek Wilk wrote:

What Peter had in mind is a nice system where we get rid of
this linear allocation of page-tables (so pgt_buf_start -> pgt_buf
_end are linearly allocated). His thinking (and Peter if I mess
up please correct me), is that we can stick the various pagetables
in different spots in memory. Mainly that as we look at mapping
a region (say 0GB->1GB), we look at in chunks (2MB?) and allocate
a page-table at the _end_ of the newly mapped chunk if we have
filled all entries in said pagetable.

For simplicity, lets say we are just dealing with PTE tables and
we are mapping the region 0GB->1GB with 4KB pages.

First we stick a page-table (or if there is a found one reuse it)
at the start of the region (so 0-2MB).

0MB.......................2MB
/-----\
|PTE_A|
\-----/

The PTE entries in it will cover 0->2MB (PTE table #A) and once it is
finished, it will stick a new pagetable at the end of the 2MB region:

0MB.......................2MB...........................4MB
/-----\ /-----\
|PTE_A| |PTE_B|
\-----/ \-----/


The PTE_B page table will be used to map 2MB->4MB.

Once that is finished .. we repeat the cycle.

That should remove the utter duct-tape madness and make this a lot
easier.


You got the basic idea right but the details slightly wrong. Let me try to explain.

When we start up, we know we have a set of page tables which maps the kernel text, data, bss and brk. This is set up by the startup code on native and by the domain builder on Xen.

We can reserve an arbitrary chunk of brk that is (a) big enough to map the kernel text+data+bss+brk itself plus (b) some arbitrary additional chunk of memory (perhaps we reserve another 256K of brk or so, enough to map 128 MB in the worst case of 4K PAE pages.)

Step 1:

- Create page table mappings for kernel text+data+bss+brk out of the
brk region.

Step 2:

- Start creating mappings for the topmost memory region downward, until
the brk reserved area is exhaused.

Step 3:

- Call a paravirt hook on the page tables created so far. On native
this does nothing, on Xen it can map it readonly and tell the
hypervisor it is a page table.

Step 4:

- Switch to the newly created page table. The bootup page table is now
obsolete.

Step 5:

- Moving downward from the last address mapped, create new page tables
for any additional unmapped memory region until either we run out of
unmapped memory regions, or we run out of mapped memory for
the memory regions to map.

Step 6:

- Call the paravirt hook for the new page tables, then add them to the
page table tree.

Step 7:

- Repeat from step 5 until there are no more unmapped memory regions.


This:

a) removes any need to guesstimate how much page tables are going to
consume. We simply construct them; they may not be contiguous but
that's okay.

b) very cleanly solves the Xen problem of not wanting to status-flip
pages any more than necessary.


The only reason for moving downward rather than upward is that we want the page tables as high as possible in memory, since memory at low addresses is precious (for stupid DMA devices, for things like kexec/kdump, and so on.)

-hpa





--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/