Re: "x86-64, mm: Put early page table high" causes crash on Xen

From: Stefano Stabellini
Date: Tue Mar 08 2011 - 07:14:57 EST


On Mon, 7 Mar 2011, Yinghai Lu wrote:
> On Mon, Mar 7, 2011 at 7:47 AM, Stefano Stabellini
> <stefano.stabellini@xxxxxxxxxxxxx> wrote:
> > On Tue, 1 Mar 2011, Yinghai Lu wrote:
> >> On 03/01/2011 09:21 AM, Stefano Stabellini wrote:
> >> > Yinghai Lu,
> >> > while testing tip/x86/mm on Xen I found out that the commit "x86-64, mm:
> >> > Put early page table high" reliably crashes Linux at boot.
> >> > The reason is well explained by the commit message of
> >> > fef5ba797991f9335bcfc295942b684f9bf613a1:
> >> >
> >> > "Xen requires that all pages containing pagetable entries to be mapped
> >> > read-only. ÂIf pages used for the initial pagetable are already mapped
> >> > then we can change the mapping to RO. ÂHowever, if they are initially
> >> > unmapped, we need to make sure that when they are later mapped, they
> >> > are also mapped RO.
> >> >
> >> > We do this by knowing that the kernel pagetable memory is pre-allocated
> >> > in the range e820_table_start - e820_table_end, so any pfn within this
> >> > range should be mapped read-only. ÂHowever, the pagetable setup code
> >> > (Those mappings are removed before the pages are presented to Xen
> >> > as pagetable pages.)"
> >> >
> >> > In other words mask_rw_pte (called by xen_set_pte_init) should mark RO
> >> > the already existing pagetable pages (like the ones belonging to the
> >> > initial mappings), while it should mark RW the new pages not yet hooked
> >> > into the pagetable. ÂThis is what the following lines used to achieve,
> >> > but don't anymore:
> >> >
> >> > Â Â /*
> >> > Â Â Â* If the new pfn is within the range of the newly allocated
> >> > Â Â Â* kernel pagetable, and it isn't being mapped into an
> >> > Â Â Â*/
> >> > Â Â Â Â pfn >= pgt_buf_start && pfn < pgt_buf_end)
> >> > Â Â Â Â Â Â pte = pte_wrprotect(pte);
> >> >
> >> > Unfortunately now we map the already existing initial pagetable pages a
> >> > second time and the new zeroed pages using map_low_page, so we are
> >> > unable to distinguish between the two.
> >> >
> >> > Can we go back to the previous way of accessing pagetable pages from
> >> > kernel_physical_mapping_init, while keeping the new pagetable allocation
> >> > strategy? It seems to me that the introduction of map_low_page is not
> >> > actually required, is it? In that case we could just revert that bit...
> >> > (appended partial revert example).
> >>
> >> We do need map_low_page ( BTW, that name is totally misleading...)
> >>
> >> the reason is we put page_table high and at that time is not under max_pfn_mapped. (aka not mapped).
> >>
> >> So have to use
> >> Â Â Â adr = early_memremap(phys & PAGE_MASK, PAGE_SIZE);
> >> to early map it and Read/Write to it.
> >
> > As you might already know from my other email few days ago, I have found
> > a solution to this problem modifying only Xen specific code.
> > However I went back to this because I want to be sure that the patch is
> > the correct solution rather than a workaround.
> >
> > It seems to me that only pagetable pages freshly allocated with
> > alloc_low_page needs to be mapped because initial pagetable pages are
> > under max_pfn_mapped.
> > Considering that kernel_physical_mapping_init doesn't go through the
> > same pagetable page twice, it is never the case that we map_low_page
> > anything but initial pagetable pages that are under max_pfn_mapped
> > anyway.
> > For example this is what happens on my machine:
> >
> >
> > [ Â Â0.000000] initial memory mapped : 0 - 024be000
> > [ Â Â0.000000] init_memory_mapping: 0000000000000000-00000000cffc2000
> > [ Â Â0.000000] Â0000000000 - 00cffc2000 page 4k
> > [ Â Â0.000000] kernel direct mapping tables up to cffc2000 @ cf93d000-cffc2000
> > [ Â Â0.000000] DEBUG map low page(2004000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2008000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c62000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c65000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c66000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c67000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c68000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c69000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c6a000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c6b000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c6c000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c6d000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c6e000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c6f000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c70000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c71000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c72000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c73000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c74000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c75000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c76000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c77000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c78000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c79000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c7a000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(2c7b000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(239a000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(239b000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(239c000, 00001000)
> > [ Â Â0.000000] DEBUG map low page(239d000, 00001000)
>
> these already mapped.

Right. So why are we mapping them again using map_low_page?
Maybe we can just access them directly like we did before?
In my tests going back to using phys_pud_update does not cause any
regressions.


> > [ Â Â0.000000] DEBUG alloc low page(cf93d000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf93e000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf93f000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf940000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf941000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf942000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf943000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf944000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf945000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf946000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf947000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf948000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf949000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf94a000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf94b000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf94c000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf94d000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf94e000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf94f000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf950000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf951000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf952000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf953000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf954000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf955000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf956000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf957000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf958000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf959000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf95a000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf95b000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf95c000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf95d000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf95e000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf95f000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf960000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf961000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf962000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf963000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf964000, 00001000)
> > [ Â Â0.000000] DEBUG alloc low page(cf965000, 00001000)
>
> these are not mapped yet, and need to use early_ioremap.

In fact we are mapping them inside alloc_low_page. No need for
map_low_page, right?


> > if this is the case the introduction of map_low_page is unnecessary and
> > actually makes the kernel map pages that are already mapped anyway.
> > Am I correct?
>
> initial mapping only to 512M, and new page for page tables could above that.

Sure, but the new pages are mapped from within alloc_low_page, so it
looks like the call to map_low_page is not useful and we could just go
back to access already existing pagetable pages directly.
For example:


@@ -554,6 +554,17 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
return last_map_addr;
}

+static unsigned long __meminit
+phys_pud_update(pgd_t *pgd, unsigned long addr, unsigned long end,
+ unsigned long page_size_mask)
+{
+ pud_t *pud;
+
+ pud = (pud_t *)pgd_page_vaddr(*pgd);
+
+ return phys_pud_init(pud, addr, end, page_size_mask);
+}
+
unsigned long __meminit
kernel_physical_mapping_init(unsigned long start,
unsigned long end,
@@ -577,10 +588,8 @@ kernel_physical_mapping_init(unsigned long start,
next = end;

if (pgd_val(*pgd)) {
- pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
- last_map_addr = phys_pud_init(pud, __pa(start),
+ last_map_addr = phys_pud_update(pgd, __pa(start),
__pa(end), page_size_mask);
- unmap_low_page(pud);
continue;
}