Re: cma_alloc(), add sleep-and-retry for temporary page pinning

From: Andrew Morton
Date: Fri Aug 21 2020 - 18:01:46 EST


On Tue, 11 Aug 2020 15:20:47 -0700 cgoldswo@xxxxxxxxxxxxxx wrote:

> On 2020-08-06 18:31, Andrew Morton wrote:
> > On Wed, 5 Aug 2020 19:56:21 -0700 Chris Goldsworthy
> > <cgoldswo@xxxxxxxxxxxxxx> wrote:
> >
> >> On mobile devices, failure to allocate from a CMA area constitutes a
> >> functional failure. Sometimes during CMA allocations, we have
> >> observed
> >> that pages in a CMA area allocated through alloc_pages(), that we're
> >> trying
> >> to migrate away to make room for a CMA allocation, are temporarily
> >> pinned.
> >> This temporary pinning can occur when a process that owns the pinned
> >> page
> >> is being forked (the example is explained further in the commit text).
> >> This patch addresses this issue by adding a sleep-and-retry loop in
> >> cma_alloc() . There's another example we know of similar to the above
> >> that
> >> occurs during exit_mmap() (in zap_pte_range() specifically), but I
> >> need to
> >> determine if this is still relevant today.
> >
>
> > Sounds fairly serious but boy, we're late for 5.9.
> >
> > I can queue it for 5.10 with a cc:stable so that it gets backported
> > into earlier kernels a couple of months from now, if we think the
> > seriousness justifies backporting(?).
> >
>
> Queuing this seems like the best way to proceed

I'd really prefer not. It's very hacky and it isn't a fix - it's a
likelihood-reducer.

> >
> > And... it really is a sad little patch, isn't it? Instead of fixing
> > the problem, it reduces the problem's probability by 5x. Can't we do
> > better than this?
>
> I have one alternative in mind. I have been able to review the
> exit_mmap()
> case, so before proceeding, let's do a breakdown of the problem: we can
> categorize the pinning issue we're trying to address here as being one
> of
> (1) incrementing _refcount and getting context-switched out before
> incrementing _mapcount (applies to forking a process / copy_one_pte()),
> and
> (2) decrementing _mapcount and getting context-switched out before
> decrementing _refcount (applies to tearing down a process /
> exit_mmap()).
> So, one alternative would be to insert preempt_disable/enable() calls at
> affected sites. So, for the copy_one_pte() pinning case, we could do the
> following inside of copy_one_pte():
>
> if (page) {
> + preempt_disable();
> get_page(page);
> page_dup_rmap(page, false);
> + preempt_enable();
> rss[mm_counter(page)]++;
> }

This would make the retry loop much more reliable, and
preempt_disable() is fast. Such additions should be clearly commented
(a bare preempt_disable() explains nothing). Perhaps by wrapping the
prerempt_disable() in a suitably-named wrapper function.

But it's still really unpleasant.

>
> The good thing about this patch is that it has been stable in our kernel
> for four years (though for some SoCs we increased the retry counts).

That's discouraging!

> One
> thing to stress is that there are other instances of CMA page pinning,
> that
> this patch isn't attempting to address.

Oh. How severe are these?

> Please let me know if you're
> okay
> with queuing this for the 5.10 merge window - if you are, I can add an
> option to configure the number of retries, and will resend the patch
> once
> the 5.9 merge window closes.

Well. Why not wait infinitely? Because there are other sources of CMA
page pinning, I guess.


Could we take a sleeping lock on the exit_mmap() path and on the
migration path? So that the migration path automatically waits for
the exact amount of time?