Re: [PATCH] mm/memory_hotplug: return zero from do_migrate_range() for only success

From: SeongJae Park
Date: Wed Feb 15 2023 - 17:34:02 EST


On Wed, 15 Feb 2023 21:00:50 +0100 David Hildenbrand <david@xxxxxxxxxx> wrote:

> On 15.02.23 19:03, SeongJae Park wrote:
> > On Wed, 15 Feb 2023 14:16:05 +0100 David Hildenbrand <david@xxxxxxxxxx> wrote:
> >
> >> On 14.02.23 23:32, SeongJae Park wrote:
> >>> do_migrate_range() returns migrate_pages() return value, which zero
> >>> means perfect success, in usual cases. If all pages are failed to be
> >>> isolated, however, it returns isolate_{lru,movalbe}_page() return
> >>> values, or zero if all pfn were invalid, were hugetlb or hwpoisoned. So
> >>> do_migrate_range() returning zero means either perfect success, or
> >>> special cases of isolation total failure.
> >>>
> >>> Actually, the return value is not checked by any caller, so it might be
> >>> better to simply make it a void function. However, there is a TODO for
> >>> checking the return value.
> >>
> >> I'd prefer to not add more dead code ;) Let's not return an error instead.
> >
> > Makes sense, I will send next spin soon.
> >
> >>
> >> It's still unclear which kind of fatal migration issues we actually care
> >> about and how to really detect them.
> >
> > What do you think about treating the isolation/migration rate limit
> > (migrate_rs) hit in do_migrate_range() as fatal? It warns for the event
> > already, so definitely a bad sign.
> >
> > If that's not that bad enough to be treated as fatal, I think we could have yet
> > another rate limit to be considered fatal.
>
> IIRC, there are some setups where offlining might take several minutes
> (e.g., heavy O_DIRECT load) and that's to be expected.
>
> So the existing code warns for better debugging, but keeps trying. So
> the ratelimit is rather to not produce too much debug output, not to
> really indicate that something is fatal.

Thank you for clarification, David!


Thanks,
SJ

>
> --
> Thanks,
>
> David / dhildenb