Re: [v1 resend 08/12] mm/thp: add split during migration support

From: Balbir Singh
Date: Sun Jul 06 2025 - 22:30:25 EST


On 7/6/25 13:03, Zi Yan wrote:
> On 5 Jul 2025, at 22:34, Zi Yan wrote:
>
>> On 5 Jul 2025, at 21:47, Balbir Singh wrote:
>>
>>> On 7/6/25 11:34, Zi Yan wrote:
>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>
>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>
>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>
>>>>>>>> s/pages/folio
>>>>>>>>
>>>>>>>
>>>>>>> Thanks, will make the changes
>>>>>>>
>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>
>>>>>>>
>>>>>>> Ack, will change the name
>>>>>>>
>>>>>>>
>>>>>>>>> *
>>>>>>>>> * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>> * It is in charge of checking whether the split is supported or not and
>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>> */
>>>>>>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>> struct page *split_at, struct page *lock_at,
>>>>>>>>> - struct list_head *list, bool uniform_split)
>>>>>>>>> + struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>> {
>>>>>>>>> struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>> XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>> * is taken to serialise against parallel split or collapse
>>>>>>>>> * operations.
>>>>>>>>> */
>>>>>>>>> - anon_vma = folio_get_anon_vma(folio);
>>>>>>>>> - if (!anon_vma) {
>>>>>>>>> - ret = -EBUSY;
>>>>>>>>> - goto out;
>>>>>>>>> + if (!isolated) {
>>>>>>>>> + anon_vma = folio_get_anon_vma(folio);
>>>>>>>>> + if (!anon_vma) {
>>>>>>>>> + ret = -EBUSY;
>>>>>>>>> + goto out;
>>>>>>>>> + }
>>>>>>>>> + anon_vma_lock_write(anon_vma);
>>>>>>>>> }
>>>>>>>>> end = -1;
>>>>>>>>> mapping = NULL;
>>>>>>>>> - anon_vma_lock_write(anon_vma);
>>>>>>>>> } else {
>>>>>>>>> unsigned int min_order;
>>>>>>>>> gfp_t gfp;
>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>> goto out_unlock;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> - unmap_folio(folio);
>>>>>>>>> + if (!isolated)
>>>>>>>>> + unmap_folio(folio);
>>>>>>>>>
>>>>>>>>> /* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>> local_irq_disable();
>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>
>>>>>>>>> ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>> split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>> - uniform_split);
>>>>>>>>> + uniform_split, isolated);
>>>>>>>>> } else {
>>>>>>>>> spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>> fail:
>>>>>>>>> if (mapping)
>>>>>>>>> xas_unlock(&xas);
>>>>>>>>> local_irq_enable();
>>>>>>>>> - remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>> + if (!isolated)
>>>>>>>>> + remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>> ret = -EAGAIN;
>>>>>>>>> }
>>>>>>>>
>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> There are two reasons for going down the current code path
>>>>>>
>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>
>>>>>
>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>> if calling the API with unmapped
>>>>
>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>> Your patch only adds support for device private page split, not any unmapped
>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>
>>>
>>> There is a use for splitting unmapped folios (see below)
>>>
>>>>>
>>>>>> You should teach different parts of folio split code path to handle
>>>>>> device private folios properly. Details are below.
>>>>>>
>>>>>>>
>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>> the split routine to return with -EBUSY
>>>>>>
>>>>>> You do something below instead.
>>>>>>
>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>> ret = -EBUSY;
>>>>>> goto out;
>>>>>> } else if (anon_vma) {
>>>>>> anon_vma_lock_write(anon_vma);
>>>>>> }
>>>>>>
>>>>>
>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>> the check for device private folios?
>>>>
>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>> in if (!isolated) branch. In that case, just do
>>>>
>>>> if (folio_is_device_private(folio) {
>>>> ...
>>>> } else if (is_anon) {
>>>> ...
>>>> } else {
>>>> ...
>>>> }
>>>>
>>>>>
>>>>>> People can know device private folio split needs a special handling.
>>>>>>
>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>> sees it as both device private and file-backed?
>>>>>>
>>>>>
>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>> the name device private.
>>>>
>>>> OK.
>>>>
>>>>>
>>>>>>
>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>> the migrate_device API has already just done as a part of the migration. The
>>>>>>> entries under consideration are already migration entries in this case.
>>>>>>> This is wasteful and in some case unexpected.
>>>>>>
>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>> sees a device private folio.
>>>>>>
>>>>>> For remap_page(), you can simply return for device private folios
>>>>>> like it is currently doing for non anonymous folios.
>>>>>>
>>>>>
>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>> remap_folio(), because
>>>>>
>>>>> 1. We need to do a page table walk/rmap walk again
>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>> in the rmap handling (set/remove migration ptes)
>>>>> 3. In this context, the code is already in the middle of migration,
>>>>> so trying to do that again does not make sense.
>>>>
>>>> Why doing split in the middle of migration? Existing split code
>>>> assumes to-be-split folios are mapped.
>>>>
>>>> What prevents doing split before migration?
>>>>
>>>
>>> The code does do a split prior to migration if THP selection fails
>>>
>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@xxxxxxxxxx/
>>> and the fallback part which calls split_folio()
>>
>> So this split is done when the folio in system memory is mapped.
>>
>>>
>>> But the case under consideration is special since the device needs to allocate
>>> corresponding pfn's as well. The changelog mentions it:
>>>
>>> "The common case that arises is that after setup, during migrate
>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>> pages."
>>>
>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>
>>> 1. migrate_vma_setup()
>>> 2. migrate_vma_pages()
>>> 3. migrate_vma_finalize()
>>>
>>> It can so happen that when we get the destination pfn's allocated the destination
>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>
>>> The pages have been unmapped and collected in migrate_vma_setup()
>>
>> So these unmapped folios are system memory folios? I thought they are
>> large device private folios.
>>
>> OK. It sounds like splitting unmapped folios is really needed. I think
>> it is better to make a new split_unmapped_folio() function
>> by reusing __split_unmapped_folio(), since __folio_split() assumes
>> the input folio is mapped.
>
> And to make __split_unmapped_folio()'s functionality match its name,
> I will later refactor it. At least move local_irq_enable(), remap_page(),
> and folio_unlocks out of it. I will think about how to deal with
> lru_add_split_folio(). The goal is to remove the to-be-added "unmapped"
> parameter from __split_unmapped_folio().
>

That sounds like a plan, it seems like there needs to be a finish phase of
the split and it does not belong to __split_unmapped_folio(). I would propose
that we rename "isolated" to "folio_is_migrating" and then your cleanups can
follow? Once your cleanups come in, we won't need to pass the parameter to
__split_unmapped_folio().

Balbir Singh