Re: [PATCH V2 2/2] arm64/mm: Enable memory hot remove

From: Anshuman Khandual
Date: Wed Apr 24 2019 - 01:59:38 EST


On 04/23/2019 09:35 PM, Mark Rutland wrote:
> On Tue, Apr 23, 2019 at 01:01:58PM +0530, Anshuman Khandual wrote:
>> Generic usage for init_mm.pagetable_lock
>>
>> Unless I have missed something else these are the generic init_mm kernel page table
>> modifiers at runtime (at least which uses init_mm.page_table_lock)
>>
>> 1. ioremap_page_range() /* Mapped I/O memory area */
>> 2. apply_to_page_range() /* Change existing kernel linear map */
>> 3. vmap_page_range() /* Vmalloc area */
>
> Internally, those all use the __p??_alloc() functions to handle racy
> additions by transiently taking the PTL when installing a new table, but
> otherwise walk kernel tables _without_ the PTL held. Note that none of
> these ever free an intermediate level of table.

Right they dont free intermediate level page table but I was curious about the
only the leaf level modifications.

>
> I believe that the idea is that operations on separate VMAs should never

I guess you meant kernel virtual range with 'VMA' but not the actual VMA which is
vm_area_struct applicable only for the user space not the kernel.

> conflict at the leaf level, and operations on the same VMA should be
> serialised somehow w.r.t. that VMA.

AFAICT see there is nothing other than hotplug lock i.e mem_hotplug_lock which
prevents concurrent init_mm modifications and the current situation is only safe
because some how these VA areas dont overlap with respect to intermediate page
table level spans.

>
> AFAICT, these functions are _never_ called on the linear/direct map or
> vmemmap VA ranges, and whether or not these can conflict with hot-remove
> is entirely dependent on whether those ranges can share a level of table
> with the vmalloc region.

Right but all these VA ranges (linear, vmemmap, vmalloc) are wired in on init_mm
hence wondering if it is prudent to assume layout scheme which varies a lot based
on different architectures while deciding possible race protections. Wondering why
these user should not call [get|put]_online_mems() to prevent race with hotplug.
Will try this out.

Unless generic MM expects these VA ranges (linear, vmemmap, vmalloc) layout to be
in certain manner from the platform guaranteeing non-overlap at intermediate level
page table spans. Only then we would not a lock.

>
> Do you know how likely that is to occur? e.g. what proportion of the

TBH I dont know.

> vmalloc region may share a level of table with the linear or vmemmap
> regions in a typical arm64 or x86 configuration? Can we deliberately
> provoke this failure case?

I have not enumerated those yet but there are multiple configs on arm64 and
probably on x86 which decides kernel VA space layout causing these potential
races. But regardless its not right to assume on vmalloc range span and not
take a lock.

Not sure how to provoke this failure case from user space with simple hotplug
because vmalloc physical allocation normally cannot be controlled without a
hacked kernel change.

>
> [...]
>
>> In all of the above.
>>
>> - Page table pages [p4d|pud|pmd|pte]_alloc_[kernel] settings are
>> protected with init_mm.page_table_lock
>
> Racy addition is protect in this manner.

Right.

>
>> - Should not it require init_mm.page_table_lock for all leaf level
>> (PUD|PMD|PTE) modification as well ?
>
> As above, I believe that the PTL is assumed to not be necessary there
> since other mutual exclusion should be in effect to prevent racy
> modification of leaf entries.

Wondering what are those mutual exclusions other than the memory hotplug lock.
Again if its on kernel VA space layout assumptions its not a good idea.

>
>> - Should not this require init_mm.page_table_lock for page table walk
>> itself ?
>>
>> Not taking an overall lock for all these three operations will
>> potentially race with an ongoing memory hot remove operation which
>> takes an overall lock as proposed. Wondering if this has this been
>> safe till now ?
>
> I suspect that the answer is that hot-remove is not thoroughly
> stress-tested today, and conflicts are possible but rare.

Will make these generic modifiers call [get|put]_online_mems() in a separate
patch at least to protect themselves from memory hot remove operation.

>
> As above, can we figure out how likely conflicts are, and try to come up
> with a stress test?

Will try something out by hot plugging a memory range without actually onlining it
while there is another vmalloc stress running on the system.

>
> Is it possible to avoid these specific conflicts (ignoring ptdump) by
> aligning VA regions such that they cannot share intermediate levels of
> table?

Kernel VA space layout is platform specific where core MM does not mandate much.
Hence generic modifiers should not make any assumptions regarding it but protect
themselves with locks. Doing any thing other than that is just pushing the problem
to future.