Re: [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in directmapping region

From: HATAYAMA Daisuke
Date: Fri Jan 18 2013 - 09:07:12 EST


From: Vivek Goyal <vgoyal@xxxxxxxxxx>
Subject: Re: [RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region
Date: Thu, 17 Jan 2013 17:13:48 -0500

> On Thu, Jan 10, 2013 at 08:59:34PM +0900, HATAYAMA Daisuke wrote:
>> Currently, kdump reads the 1st kernel's memory, called old memory in
>> the source code, using ioremap per a single page. This causes big
>> performance degradation since page tables modification and tlb flush
>> happen each time the single page is read.
>>
>> This issue turned out from Cliff's kernel-space filtering work.
>>
>> To avoid calling ioremap, we map a whole 1st kernel's memory targeted
>> as vmcore regions in direct mapping table. By this we got big
>> performance improvement. See the following simple benchmark.
>>
>> Machine spec:
>>
>> | CPU | Intel(R) Xeon(R) CPU E7- 4820 @ 2.00GHz (4 sockets, 8 cores) (*) |
>> | Memory | 32 GB |
>> | Kernel | 3.7 vanilla and with this patch set |
>>
>> (*) only 1 cpu is used in the 2nd kenrel now.
>>
>> Benchmark:
>>
>> I executed the following commands on the 2nd kernel and recorded real
>> time.
>>
>> $ time dd bs=$((4096 * n)) if=/proc/vmcore of=/dev/null
>>
>> [3.7 vanilla]
>>
>> | block size | time | performance |
>> | [KB] | | [MB/sec] |
>> |------------+-----------+-------------|
>> | 4 | 5m 46.97s | 93.56 |
>> | 8 | 4m 20.68s | 124.52 |
>> | 16 | 3m 37.85s | 149.01 |
>>
>> [3.7 with this patch]
>>
>> | block size | time | performance |
>> | [KB] | | [GB/sec] |
>> |------------+--------+-------------|
>> | 4 | 17.59s | 1.85 |
>> | 8 | 14.73s | 2.20 |
>> | 16 | 14.26s | 2.28 |
>> | 32 | 13.38s | 2.43 |
>> | 64 | 12.77s | 2.54 |
>> | 128 | 12.41s | 2.62 |
>> | 256 | 12.50s | 2.60 |
>> | 512 | 12.37s | 2.62 |
>> | 1024 | 12.30s | 2.65 |
>> | 2048 | 12.29s | 2.64 |
>> | 4096 | 12.32s | 2.63 |
>>
>
> These are impressive improvements. I missed the discussion on mmap().
> So why couldn't we provide mmap() interface for /proc/vmcore. If that
> works then application can select to mmap/unmap bigger chunks of file
> (instead ioremap mapping/remapping a page at a time).
>
> And if application controls the size of mapping, then it can vary the
> size of mapping based on available amount of free memory. That way if
> somebody reserves less amount of memory, we could still dump but with
> some time penalty.
>

mmap() needs user-space page table in addition to kernel-space's, and
it looks that remap_pfn_range() that creates the user-space page
table, doesn't support large pages, only 4KB pages. If mmaping small
chunks only for small memory programming, then we would again face the
same issue as with ioremap. I don't know whether hugetlbfs supports
mmap and 1GB page now.

Another idea to reduce size of page table is to extend mapping ranges
to cover a whole memory as many 1GB pages as possible. For example,
supporse M is size of system memory, then total size of PGD and PUD
pages to cover M is:

( 1 + roundup(M, 512GB) / 512GB ) * PAGE_SIZE
~ ~~~~~~~~~~~~~~~~~~~~~~~~~
^ ^
| |
PGD page PUD pages

Ideally, 2TB system can be covered with 20KB and 16TB with 132KB only.

So I first want to evaluate this logic. Although I've not seen
actually yet, I expect most of memory maps on tera-byte memory
machines consists of 1GB-aligned huge chunks.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/