Re: [PATCH] makedumpfile: request the kernel do page scans

From: HATAYAMA Daisuke
Date: Thu Dec 20 2012 - 20:36:04 EST

From: Cliff Wickman <cpw@xxxxxxx>
Subject: Re: [PATCH] makedumpfile: request the kernel do page scans
Date: Thu, 20 Dec 2012 09:51:47 -0600

> On Thu, Dec 20, 2012 at 12:22:14PM +0900, HATAYAMA Daisuke wrote:
>> From: Cliff Wickman <cpw@xxxxxxx>
>> Subject: Re: [PATCH] makedumpfile: request the kernel do page scans
>> Date: Mon, 10 Dec 2012 09:36:14 -0600
>> > On Mon, Dec 10, 2012 at 09:59:29AM +0900, HATAYAMA Daisuke wrote:
>> >> From: Cliff Wickman <cpw@xxxxxxx>
>> >> Subject: Re: [PATCH] makedumpfile: request the kernel do page scans
>> >> Date: Mon, 19 Nov 2012 12:07:10 -0600
>> >>
>> >> > On Fri, Nov 16, 2012 at 03:39:44PM -0500, Vivek Goyal wrote:
>> >> >> On Thu, Nov 15, 2012 at 04:52:40PM -0600, Cliff Wickman wrote:
>> >
>> > Hi Hatayama,
>> >
>> > If ioremap/iounmap is the bottleneck then perhaps you could do what
>> > my patch does: it consolidates all the ranges of physical addresses
>> > where the boot kernel's page structures reside (see make_kernel_mmap())
>> > and passes them to the kernel, which then does a handfull of ioremaps's to
>> > cover all of them. Then /proc/vmcore could look up the already-mapped
>> > virtual address.
>> > (also note a kludge in get_mm_sparsemem() that verifies that each section
>> > of the mem_map spans contiguous ranges of page structures. I had
>> > trouble with some sections when I made that assumption)
>> >
>> > I'm attaching 3 patches that might be useful in your testing:
>> > - 121210.proc_vmcore2 my current patch that applies to the released
>> > makedumpfile 1.5.1
>> > - 121207.vmcore_pagescans.sles applies to a 3.0.13 kernel
>> > - 121207.vmcore_pagescans.rhel applies to a 2.6.32 kernel
>> >
>> I used the same patch set on the benchmark.
>> BTW, I have continuously reservation issue, so I think I cannot use
>> terabyte memory machine at least in this year.
>> Also, your patch set is doing ioremap per a chunk of memory map,
>> i.e. a number of consequtive pages at the same time. On your terabyte
>> machines, how large they are? We have memory consumption issue on the
>> 2nd kernel so we must decrease amount of memory used. But looking into
>> ioremap code quickly, it looks not using 2MB or 1GB pages to
>> remap. This means more than tera bytes page table is generated. Or
>> have you probably already investigated this?
>> BTW, my idea to solve this issue are two:
>> 1) make linear direct mapping for old memory, and acess the old memory
>> via the linear direct mapping, not by ioremap.
>> - adding remap code in vmcore, or passing the regions that need to
>> be remapped using memmap= kernel option to tell the 2nd kenrel to
>> map them in addition.
> Good point. It would take over 30G of memory to map 16TB with 4k pages.
> I recently tried to dump such a memory and ran out of kernel memory --
> no wonder!

One question. Now, on terabyte memory machine, your patch set always
goes into out of kernel memory and panic when writing pages, right?
Only scanning mem_map array can complete.

> Do you have a patch for doing a linear direct mapping? Or can you name
> existing kernel infrastructure to do such mapping? I'm just looking for
> a jumpstart to enhance the patch.

I have a prototype patch only. See the patch at the end of this mail,
which tries to creates linear direct mapping using
init_memory_mapping() which supports 2MB and 1GB pages. We can see
what kind of pages are used from dmesg:

$ dmesg
initial memory mapped: [mem 0x00000000-0x1fffffff]
Base memory trampoline at [ffff880000094000] 94000 size 28672
Using GB pages for direct mapping
init_memory_mapping: [mem 0x00000000-0x7b00cfff] <-- here
[mem 0x00000000-0x3fffffff] page 1G
[mem 0x40000000-0x7affffff] page 2M
[mem 0x7b000000-0x7b00cfff] page 4k
kernel direct mapping tables up to 0x7b00cfff @ [mem 0x1fffd000-0x1fffffff]
init_memory_mapping: [mem 0x100000000-0x87fffffff] <-- here
[mem 0x100000000-0x87fffffff] page 1G
kernel direct mapping tables up to 0x87fffffff @ [mem 0x7b00c000-0x7b00cfff]
RAMDISK: [mem 0x37406000-0x37feffff]
Reserving 256MB of memory at 624MB for crashkernel (System RAM: 32687MB)

The source of memory mapping information is PT_LOAD entries of
/proc/vmcore, which is in vmcore_list defined in vmcore.c. This is
"adding remap code in vmcore" idea above. Still existing
copy_old_memory() is being left to read ELF headers.

Unfortunately, this patch is still really buggy. I saw the dump was
generated correctly on my small 1GB kvm guest machine, but some
scheduler bug occurred on the 32GB native machine which is the same as
I used for profiling your patch set.

The second idea passing memmap= kernel option is just like below.

kexec passes specific memory map information to the 2nd kenrel using
memmap= kernel parameter; maybe, only '#' is being used in
kexec. Different delimitor means different manings. '@' means System
RAM, '#' means ACPI Table and '$' means "don't use this memory".

[KNL] Force usage of a specific region of memory
Region of memory to be used, from ss to ss+nn.

[KNL,ACPI] Mark specific memory as ACPI data.
Region of memory to be used, from ss to ss+nn.

[KNL,ACPI] Mark specific memory as reserved.
Region of memory to be used, from ss to ss+nn.
Example: Exclude memory from 0x18690000-0x1869ffff

Like this, why not introducing another memmap to tell the 2nd kenrel
to make linear mapping address on boot. Or maybe it's sufficient for
this issue to use any of the above three kinds of memmap=.