Re: [PATCH linux-next] mm: swap: get rid of deadloop in swapin readahead

From: Andrew Morton
Date: Tue Mar 01 2022 - 19:33:04 EST


On Mon, 28 Feb 2022 20:07:33 -0800 (PST) Hugh Dickins <hughd@xxxxxxxxxx> wrote:

> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> > > * in swap_map, but not yet added its page to swap cache.
> > > */
> > > - cond_resched();
> > > + schedule_timeout_uninterruptible(1);
> > > }
> > >
> > > /*
> >
> > Sigh. I guess yes, we should do this, at least in a short-term,
> > backportable-to-stable way.
> >
> > But busy-waiting while hoping that someone else will save us isn't an
> > attractive design. Hugh, have you ever thought about something more
> > deterministic in there?
>
> Not something more deterministic, no: I think that would entail
> heavier locking, perhaps slowing down hotter paths, just to avoid
> this swap oddity.
>
> This loop was written long before there was a preemptive kernel:
> it was appropriate then, and almost never needed more than one retry
> to complete; but preemption changed the story without us realizing.
>
> Sigh here too. I commend the thread on it from July 2018:
> https://lore.kernel.org/linux-mm/2018072514403228778860@xxxxxxxxxxxx/
>
> There the 4.9-stable user proposed preempt_disable(), I agreed but
> found the patch provided insufficient, and offered another 4.9 patch
> further down the thread. Your preference at the time was msleep(1).
>
> I was working on a similar patch for 4.18, but have not completed it
> yet ;) and don't remember how satisfied or not I was with that one;
> and wonder if I'm any more likely to get it finished by 2026. It's
> clear that I put much more thought into it back then than just now.
>
> Maybe someone else would have a go: my 4.9 patch in that thread
> shows most of it, but might need a lot of work to update to 5.17.
>
> And it also gathered some temporary debug stats on how often this
> happens: I'm not conscious of using RT at all, but was disturbed to see
> how long an ordinary preemptive kernel was sometimes spinning there.
> So I think I agree with you more than Michal on that: RT just makes
> the bad behaviour more obvious.

Thanks as always.

Using msleep() seems pretty pointless so I plan to go ahead with patch
as-is, with a cc:stable. None of it is pretty, but it's better than
what we have now, yes?