Re: [PATCH 1/2] mm: disable LRU pagevec during the migration temporarily

From: Minchan Kim
Date: Wed Mar 03 2021 - 19:03:50 EST


On Wed, Mar 03, 2021 at 01:49:36PM +0100, Michal Hocko wrote:
> On Tue 02-03-21 13:09:48, Minchan Kim wrote:
> > LRU pagevec holds refcount of pages until the pagevec are drained.
> > It could prevent migration since the refcount of the page is greater
> > than the expection in migration logic. To mitigate the issue,
> > callers of migrate_pages drains LRU pagevec via migrate_prep or
> > lru_add_drain_all before migrate_pages call.
> >
> > However, it's not enough because pages coming into pagevec after the
> > draining call still could stay at the pagevec so it could keep
> > preventing page migration. Since some callers of migrate_pages have
> > retrial logic with LRU draining, the page would migrate at next trail
> > but it is still fragile in that it doesn't close the fundamental race
> > between upcoming LRU pages into pagvec and migration so the migration
> > failure could cause contiguous memory allocation failure in the end.
> >
> > To close the race, this patch disables lru caches(i.e, pagevec)
> > during ongoing migration until migrate is done.
> >
> > Since it's really hard to reproduce, I measured how many times
> > migrate_pages retried with force mode below debug code.
> >
> > int migrate_pages(struct list_head *from, new_page_t get_new_page,
> > ..
> > ..
> >
> > if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
> > printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
> > dump_page(page, "fail to migrate");
> > }
> >
> > The test was repeating android apps launching with cma allocation
> > in background every five seconds. Total cma allocation count was
> > about 500 during the testing. With this patch, the dump_page count
> > was reduced from 400 to 30.
>
> Have you seen any improvement on the CMA allocation success rate?

Unfortunately, the cma alloc failure rate with reasonable margin
of error is really hard to reproduce under real workload.
That's why I measured the soft metric instead of direct cma fail
under real workload(I don't want to make some adhoc artificial
benchmark and keep tunes system knobs until it could show
extremly exaggerated result to convice patch effect).

Please say if you belive this work is pointless unless there is
stable data under reproducible scenario. I am happy to drop it.

>
> > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > ---
> > * from RFC - http://lore.kernel.org/linux-mm/20210216170348.1513483-1-minchan@xxxxxxxxxx
> > * use atomic and lru_add_drain_all for strict ordering - mhocko
> > * lru_cache_disable/enable - mhocko
> >
> > fs/block_dev.c | 2 +-
> > include/linux/migrate.h | 6 +++--
> > include/linux/swap.h | 4 ++-
> > mm/compaction.c | 4 +--
> > mm/fadvise.c | 2 +-
> > mm/gup.c | 2 +-
> > mm/khugepaged.c | 2 +-
> > mm/ksm.c | 2 +-
> > mm/memcontrol.c | 4 +--
> > mm/memfd.c | 2 +-
> > mm/memory-failure.c | 2 +-
> > mm/memory_hotplug.c | 2 +-
> > mm/mempolicy.c | 6 +++++
> > mm/migrate.c | 15 ++++++-----
> > mm/page_alloc.c | 5 +++-
> > mm/swap.c | 55 +++++++++++++++++++++++++++++++++++------
> > 16 files changed, 85 insertions(+), 30 deletions(-)
>
> The churn seems to be quite big for something that should have been a
> very small change. Have you considered not changing lru_add_drain_all
> but rather introduce __lru_add_dain_all that would implement the
> enforced flushing?

Good idea.

>
> [...]
> > +static atomic_t lru_disable_count = ATOMIC_INIT(0);
> > +
> > +bool lru_cache_disabled(void)
> > +{
> > + return atomic_read(&lru_disable_count);
> > +}
> > +
> > +void lru_cache_disable(void)
> > +{
> > + /*
> > + * lru_add_drain_all's IPI will make sure no new pages are added
> > + * to the pcp lists and drain them all.
> > + */
> > + atomic_inc(&lru_disable_count);
>
> As already mentioned in the last review. The IPI reference is more
> cryptic than useful. I would go with something like this instead
>
> /*
> * lru_add_drain_all in the force mode will schedule draining on
> * all online CPUs so any calls of lru_cache_disabled wrapped by
> * local_lock or preemption disabled would be ordered by that.
> * The atomic operation doesn't need to have stronger ordering
> * requirements because that is enforece by the scheduling
> * guarantees.
> */

Thanks for the nice description.
I will use it in next revision if you believe this work is useful.