Re: [PATCH 1/2] mm: disable LRU pagevec during the migration temporarily

From: David Hildenbrand
Date: Thu Mar 04 2021 - 03:10:20 EST


On 03.03.21 21:23, Minchan Kim wrote:
On Wed, Mar 03, 2021 at 01:49:36PM +0100, Michal Hocko wrote:
On Tue 02-03-21 13:09:48, Minchan Kim wrote:
LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec)
during ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..

if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}

The test was repeating android apps launching with cma allocation
in background every five seconds. Total cma allocation count was
about 500 during the testing. With this patch, the dump_page count
was reduced from 400 to 30.

Have you seen any improvement on the CMA allocation success rate?

Unfortunately, the cma alloc failure rate with reasonable margin
of error is really hard to reproduce under real workload.
That's why I measured the soft metric instead of direct cma fail
under real workload(I don't want to make some adhoc artificial
benchmark and keep tunes system knobs until it could show
extremly exaggerated result to convice patch effect).

Please say if you belive this work is pointless unless there is
stable data under reproducible scenario. I am happy to drop it.

Do you have *some* application that triggers such a high retry count?

I'd love to run it along with virtio-mem and report the actual allocation success rate / necessary retries. That could give an indication of how helpful your work would be.

Anything that improves the reliability of alloc_contig_range() is of high interest to me. If it doesn't increase the reliability but merely does some internal improvements (less retries), it might still be valuable, but not that important.

--
Thanks,

David / dhildenb