Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

From: Vlastimil Babka
Date: Thu Aug 22 2019 - 09:46:10 EST


On 8/22/19 10:04 AM, Michal Hocko wrote:
> On Thu 22-08-19 01:55:25, Yang Shi wrote:
>> Available memory is one of the most important metrics for memory
>> pressure.
>
> I would disagree with this statement. It is a rough estimate that tells
> how much memory you can allocate before going into a more expensive
> reclaim (mostly swapping). Allocating that amount still might result in
> direct reclaim induced stalls. I do realize that this is simple metric
> that is attractive to use and works in many cases though.
>
>> Currently, the deferred split THPs are not accounted into
>> available memory, but they are reclaimable actually, like reclaimable
>> slabs.
>>
>> And, they seems very common with the common workloads when THP is
>> enabled. A simple run with MariaDB test of mmtest with THP enabled as
>> always shows it could generate over fifteen thousand deferred split THPs
>> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
>> It looks worth accounting in MemAvailable.
>
> OK, this makes sense. But your above numbers are really worrying.
> Accumulating such a large amount of pages that are likely not going to
> be used is really bad. They are essentially blocking any higher order
> allocations and also push the system towards more memory pressure.
>
> IIUC deferred splitting is mostly a workaround for nasty locking issues
> during splitting, right? This is not really an optimization to cache
> THPs for reuse or something like that. What is the reason this is not
> done from a worker context? At least THPs which would be freed
> completely sound like a good candidate for kworker tear down, no?

Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
also wouldn't need the cgroup awareness?

>> Record the number of freeable normal pages of deferred split THPs into
>> the second tail page, and account it into KReclaimable. Although THP
>> allocations are not exactly "kernel allocations", once they are unmapped,
>> they are in fact kernel-only. KReclaimable has been accounted into
>> MemAvailable.
>
> This sounds reasonable to me.
>
>> When the deferred split THPs get split due to memory pressure or freed,
>> just decrease by the recorded number.
>>
>> With this change when running program which populates 1G address space
>> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
>> show the deferred split THPs are accounted properly.
>>
>> Populated by before calling madvise(MADV_DONTNEED):
>> MemAvailable: 43531960 kB
>> AnonPages: 1096660 kB
>> KReclaimable: 26156 kB
>> AnonHugePages: 1056768 kB
>>
>> After calling madvise(MADV_DONTNEED):
>> MemAvailable: 44411164 kB
>> AnonPages: 50140 kB
>> KReclaimable: 1070640 kB
>> AnonHugePages: 10240 kB
>>
>> Suggested-by: Vlastimil Babka <vbabka@xxxxxxx>
>> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
>> Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
>> Cc: David Rientjes <rientjes@xxxxxxxxxx>
>> Signed-off-by: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx>

Thanks, looks like it wasn't too difficult with the 2nd tail page use :)

...

>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page)
>>
>> INIT_LIST_HEAD(page_deferred_list(page));
>> set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>> + page[2].nr_freeable = 0;
>> }
>>
>> static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
>> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>> if (!list_empty(page_deferred_list(head))) {
>> ds_queue->split_queue_len--;
>> list_del(page_deferred_list(head));
>> + __mod_node_page_state(page_pgdat(page),
>> + NR_KERNEL_MISC_RECLAIMABLE,
>> + -head[2].nr_freeable);
>> + head[2].nr_freeable = 0;
>> }
>> if (mapping)
>> __dec_node_page_state(page, NR_SHMEM_THPS);
>> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
>> ds_queue->split_queue_len--;
>> list_del(page_deferred_list(page));
>> }
>> + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>> + -page[2].nr_freeable);
>> + page[2].nr_freeable = 0;

Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the
deffered list? So here the code would be in the if (!list_empty()) { } part above.

>> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> free_compound_page(page);
>> }
>>
>> -void deferred_split_huge_page(struct page *page)
>> +void deferred_split_huge_page(struct page *page, unsigned int nr)
>> {
>> struct deferred_split *ds_queue = get_deferred_split_queue(page);
>> #ifdef CONFIG_MEMCG
>> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
>> return;
>>
>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> + page[2].nr_freeable += nr;
>> + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>> + nr);

Same here, only do this when adding to the list, below? Or we might perhaps
account base pages multiple times?

>> if (list_empty(page_deferred_list(page))) {
>> count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>> list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index e5dfe2a..6008fab 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
>>
>> if (nr) {
>> __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
>> - deferred_split_huge_page(page);
>> + deferred_split_huge_page(page, nr);
>> }
>> }
>>
>> @@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound)
>> clear_page_mlock(page);
>>
>> if (PageTransCompound(page))
>> - deferred_split_huge_page(compound_head(page));
>> + deferred_split_huge_page(compound_head(page), 1);
>>
>> /*
>> * It would be tidy to reset the PageAnon mapping here,
>> --
>> 1.8.3.1
>