Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom

From: Michal Hocko
Date: Tue Feb 16 2021 - 03:14:08 EST

Next message: David Hildenbrand: "Re: [External] Re: [PATCH v15 4/8] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page"
Previous message: Geert Uytterhoeven: "Re: [PATCH v4 0/8] Make fw_devlink=on more forgiving"
Next in thread: David Rientjes: "Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue 16-02-21 03:07:13, Eiichi Tsukata wrote:
> Hugepages can be preallocated to avoid unpredictable allocation latency.
> If we run into 4k page shortage, the kernel can trigger OOM even though
> there were free hugepages. When OOM is triggered by user address page
> fault handler, we can use oom notifier to free hugepages in user space
> but if it's triggered by memory allocation for kernel, there is no way
> to synchronously handle it in user space.

Can you expand some more on what kind of problem do you see?
Hugetlb pages are, by definition, a preallocated, unreclaimable and
admin controlled pool of pages. Under those conditions it is expected
and required that the sizing would be done very carefully. Why is that a
problem in your particular setup/scenario?

If the sizing is really done properly and then a random process can
trigger OOM then this can lead to malfunctioning of those workloads
which do depend on hugetlb pool, right? So isn't this a kinda DoS
scenario?

> This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
> enabled, it first tries to free a hugepage if available before invoking
> the oom-killer. The default value is disabled not to change the current
> behavior.

Why is this interface not hugepage size aware? It is quite different to
release a GB huge page or 2MB one. Or is it expected to release the
smallest one? To the implementation...

[...]
> +static int sacrifice_hugepage(void)
> +{
> + int ret;
> +
> + spin_lock(&hugetlb_lock);
> + ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0);

... no it is going to release the default huge page. This will be 2MB in
most cases but this is not given.

Unless I am mistaken this will free up also reserved hugetlb pages. This
would mean that a page fault would SIGBUS which is very likely not
something we want to do right? You also want to use oom nodemask rather
than a full one.

Overall, I am not really happy about this feature even when above is
fixed, but let's hear more the actual problem first.
--
Michal Hocko
SUSE Labs

Next message: David Hildenbrand: "Re: [External] Re: [PATCH v15 4/8] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page"
Previous message: Geert Uytterhoeven: "Re: [PATCH v4 0/8] Make fw_devlink=on more forgiving"
Next in thread: David Rientjes: "Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]