On 8/22/19 10:04 AM, Michal Hocko wrote:
On Thu 22-08-19 01:55:25, Yang Shi wrote:Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
Available memory is one of the most important metrics for memoryI would disagree with this statement. It is a rough estimate that tells
pressure.
how much memory you can allocate before going into a more expensive
reclaim (mostly swapping). Allocating that amount still might result in
direct reclaim induced stalls. I do realize that this is simple metric
that is attractive to use and works in many cases though.
Currently, the deferred split THPs are not accounted intoOK, this makes sense. But your above numbers are really worrying.
available memory, but they are reclaimable actually, like reclaimable
slabs.
And, they seems very common with the common workloads when THP is
enabled. A simple run with MariaDB test of mmtest with THP enabled as
always shows it could generate over fifteen thousand deferred split THPs
(accumulated around 30G in one hour run, 75% of 40G memory for my VM).
It looks worth accounting in MemAvailable.
Accumulating such a large amount of pages that are likely not going to
be used is really bad. They are essentially blocking any higher order
allocations and also push the system towards more memory pressure.
IIUC deferred splitting is mostly a workaround for nasty locking issues
during splitting, right? This is not really an optimization to cache
THPs for reuse or something like that. What is the reason this is not
done from a worker context? At least THPs which would be freed
completely sound like a good candidate for kworker tear down, no?
also wouldn't need the cgroup awareness?
Thanks, looks like it wasn't too difficult with the 2nd tail page use :)Record the number of freeable normal pages of deferred split THPs intoThis sounds reasonable to me.
the second tail page, and account it into KReclaimable. Although THP
allocations are not exactly "kernel allocations", once they are unmapped,
they are in fact kernel-only. KReclaimable has been accounted into
MemAvailable.
When the deferred split THPs get split due to memory pressure or freed,
just decrease by the recorded number.
With this change when running program which populates 1G address space
then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
show the deferred split THPs are accounted properly.
Populated by before calling madvise(MADV_DONTNEED):
MemAvailable: 43531960 kB
AnonPages: 1096660 kB
KReclaimable: 26156 kB
AnonHugePages: 1056768 kB
After calling madvise(MADV_DONTNEED):
MemAvailable: 44411164 kB
AnonPages: 50140 kB
KReclaimable: 1070640 kB
AnonHugePages: 10240 kB
Suggested-by: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Signed-off-by: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx>
...
Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page)
INIT_LIST_HEAD(page_deferred_list(page));
set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+ page[2].nr_freeable = 0;
}
static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
@@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (!list_empty(page_deferred_list(head))) {
ds_queue->split_queue_len--;
list_del(page_deferred_list(head));
+ __mod_node_page_state(page_pgdat(page),
+ NR_KERNEL_MISC_RECLAIMABLE,
+ -head[2].nr_freeable);
+ head[2].nr_freeable = 0;
}
if (mapping)
__dec_node_page_state(page, NR_SHMEM_THPS);
@@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
ds_queue->split_queue_len--;
list_del(page_deferred_list(page));
}
+ __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
+ -page[2].nr_freeable);
+ page[2].nr_freeable = 0;
deffered list? So here the code would be in the if (!list_empty()) { } part above.
Same here, only do this when adding to the list, below? Or we might perhapsspin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
free_compound_page(page);
}
-void deferred_split_huge_page(struct page *page)
+void deferred_split_huge_page(struct page *page, unsigned int nr)
{
struct deferred_split *ds_queue = get_deferred_split_queue(page);
#ifdef CONFIG_MEMCG
@@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
return;
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+ page[2].nr_freeable += nr;
+ __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
+ nr);
account base pages multiple times?
if (list_empty(page_deferred_list(page))) {
count_vm_event(THP_DEFERRED_SPLIT_PAGE);
list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2a..6008fab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
if (nr) {
__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
- deferred_split_huge_page(page);
+ deferred_split_huge_page(page, nr);
}
}
@@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound)
clear_page_mlock(page);
if (PageTransCompound(page))
- deferred_split_huge_page(compound_head(page));
+ deferred_split_huge_page(compound_head(page), 1);
/*
* It would be tidy to reset the PageAnon mapping here,
--
1.8.3.1