Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node

From: Zi Yan
Date: Wed Mar 27 2019 - 16:43:00 EST


On 27 Mar 2019, at 11:00, Dave Hansen wrote:

> On 3/27/19 10:48 AM, Zi Yan wrote:
>> For 40MB/s vs 750MB/s, they were using sys_migrate_pages(). Sorry
>> about the confusion there. As I measure only the migrate_pages() in
>> the kernel, the throughput becomes: migrating 4KB page: 0.312GB/s
>> vs migrating 512 4KB pages: 0.854GB/s. They are still >2x
>> difference.
>>
>> Furthermore, if we only consider the migrate_page_copy() in
>> mm/migrate.c, which only calls copy_highpage() and
>> migrate_page_states(), the throughput becomes: migrating 4KB page:
>> 1.385GB/s vs migrating 512 4KB pages: 1.983GB/s. The gap is
>> smaller, but migrating 512 4KB pages still achieves 40% more
>> throughput.
>>
>> Do these numbers make sense to you?
>
> Yes. It would be very interesting to batch the migrations in the
> kernel and see how it affects the code. A 50% boost is interesting,
> but not if it's only in microbenchmarks and takes 2k lines of code.
>
> 50% is *very* interesting if it happens in the real world and we can
> do it in 10 lines of code.
>
> So, let's see what the code looks like.

Actually, the migration throughput difference does not come from any kernel
changes, it is a pure comparison between migrate_pages(single 4KB page) and
migrate_pages(a list of 4KB pages). The point I wanted to make is that
Yangâs approach, which migrates a list of pages at the end of shrink_page_list(),
can achieve higher throughput than Keithâs approach, which migrates one page
at a time in the while loop inside shrink_page_list().

In addition to the above, migrating a single THP can get us even higher throughput.
Here is the throughput numbers comparing all three cases:
| migrate_page() | migrate_page_copy()
migrating single 4KB page: | 0.312GB/s | 1.385GB/s
migrating 512 4KB pages: | 0.854GB/s | 1.983GB/s
migrating single 2MB THP: | 2.387GB/s | 2.481GB/s

Obviously, we would like to migrate THPs as a whole instead of 512 4KB pages
individually. Of course, this assumes we have free space in PMEM for THPs and
all subpages in the THPs are cold.


To batch the migration, I posted some code a while ago: https://lwn.net/Articles/714991/,
which show good throughput improvement for microbenchmarking sys_migrate_page().
It also included using multi threads to copy a page, aggregate multiple migrate_page_copy(),
and even using DMA instead of CPUs to copy data. We could revisit the code if necessary.

In terms of end-to-end results, I also have some results from my paper:
http://www.cs.yale.edu/homes/abhishek/ziyan-asplos19.pdf (Figure 8 to Figure 11 show the
microbenchmark result and Figure 12 shows end-to-end results). I basically called
shrink_active/inactive_list() every 5 seconds to track page hotness and used all my page
migration optimizations above, which can get 40% application runtime speedup on average.
The experiments were done in a two-socket NUMA machine where one node was slowed down to
have 1/2 BW and 2x access latency, compared to the other node. I can discuss about it
more if you are interested.


--
Best Regards,
Yan Zi

Attachment: signature.asc
Description: OpenPGP digital signature