Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9

From: Florian Fainelli
Date: Thu Apr 22 2021 - 15:32:00 EST




On 4/22/2021 11:35 AM, David Hildenbrand wrote:
> On 22.04.21 19:50, Florian Fainelli wrote:
>>
>>
>> On 4/22/2021 1:56 AM, David Hildenbrand wrote:
>>> On 22.04.21 09:49, Michal Hocko wrote:
>>>> Cc David and Oscar who are familiar with this code as well.
>>>>
>>>> On Wed 21-04-21 11:36:01, Florian Fainelli wrote:
>>>>> Hi all,
>>>>>
>>>>> I have been trying for the past few days to identify the source of a
>>>>> performance regression that we are seeing with the 5.4 kernel but not
>>>>> with the 4.9 kernel on ARM64. Testing something newer like 5.10 is
>>>>> a bit
>>>>> challenging at the moment but will happen eventually.
>>>>>
>>>>> What we are seeing is a ~3x increase in the time needed for
>>>>> alloc_contig_range() to allocate 1GB in blocks of 2MB pages. The
>>>>> system
>>>>> is idle at the time and there are no other contenders for memory other
>>>>> than the user-space programs already started (DHCP client, shell,
>>>>> etc.).
>>>
>>> Hi,
>>>
>>> If you can easily reproduce it might be worth to just try bisecting;
>>> that could be faster than manually poking around in the code.
>>>
>>> Also, it would be worth having a look at the state of upstream Linux.
>>> Upstream Linux developers tend to not care about minor performance
>>> regressions on oldish kernels.
>>
>> This is a big pain point here and I cannot agree more, but until we
>> bridge that gap, this is not exactly easy to do for me unfortunately and
>> neither is bisection :/
>>
>>>
>>> There has been work on improving exactly the situation you are
>>> describing -- a "fail fast" / "no retry" mode for alloc_contig_range().
>>> Maybe it tackles exactly this issue.
>>>
>>> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@xxxxxxxxxx
>>>
>>> Minchan is already on cc.
>>
>> This patch does not appear to be helping, in fact, I had locally applied
>> this patch from way back when:
>>
>> https://lkml.org/lkml/2014/5/28/113
>>
>> which would effectively do this unconditionally. Let me see if I can
>> showcase this problem a x86 virtual machine operating in similar
>> conditions to ours.
>
> How exactly are you allocating these 2MiB blocks?
>
> Via CMA->alloc_contig_range() or via alloc_contig_range() directly? I
> assume via CMA.

I am allocating this memory directly via alloc_contig_range(start, end,
MIGRATE_MOVABLE, GFP_KERNEL), just looping over 1024MB via 2MB
increments. This is just a synthetic benchmark though we do have an
allocator that behaves just like that as well.

>
> For
>
> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@xxxxxxxxxx
>
> to do its work you'll have to pass  __GFP_NORETRY to
> alloc_contig_range(). This requires CMA adaptions, from where we call
> alloc_contig_range().

Yes, I did modify the alloc_contig_range() caller to pass GFP_KERNEL |
__GFP_NORETRY. I did run for a more iterations (1000) and the results
are not very conclusive as with __GFP_NORETRY the allocation time per
allocation was not significantly better, in fact it was slightly worse
by 100us than without.

My x86 VM with 1GB of DRAM including 512MB being in ZONE_MOVABLE does
shows identical numbers for both 4.9 and 5.4 so this must be something
specific to ARM64 and/or the code we added to create a ZONE_MOVABLE on
that architecture since movablecore does not appear to have any effect
unlike x86.
--
Florian