Re: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()

From: NeilBrown
Date: Mon Sep 13 2021 - 23:28:05 EST


On Tue, 14 Sep 2021, Dave Chinner wrote:
> On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > Documentation commment in gfp.h discourages indefinite retry loops on
> > ENOMEM and says of __GFP_NOFAIL that it
> >
> > is definitely preferable to use the flag rather than opencode
> > endless loop around allocator.
> >
> > So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> > not given.
> >
> > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > ---
> > fs/xfs/kmem.c | 16 ++++------------
> > 1 file changed, 4 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 6f49bf39183c..f545f3633f88 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> > {
> > int retries = 0;
> > gfp_t lflags = kmem_flags_convert(flags);
> > - void *ptr;
> >
> > trace_kmem_alloc(size, flags, _RET_IP_);
> >
> > - do {
> > - ptr = kmalloc(size, lflags);
> > - if (ptr || (flags & KM_MAYFAIL))
> > - return ptr;
> > - if (!(++retries % 100))
> > - xfs_err(NULL,
> > - "%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> > - current->comm, current->pid,
> > - (unsigned int)size, __func__, lflags);
> > - congestion_wait(BLK_RW_ASYNC, HZ/50);
> > - } while (1);
> > + if (!(flags & KM_MAYFAIL))
> > + lflags |= __GFP_NOFAIL;
> > +
> > + return kmalloc(size, lflags);
> > }
>
> Which means we no longer get warnings about memory allocation
> failing - kmem_flags_convert() sets __GFP_NOWARN for all allocations
> in this loop. Hence we'll now get silent deadlocks through this code
> instead of getting warnings that memory allocation is failing
> repeatedly.

Yes, that is a problem. Could we just clear __GFP_NOWARN when setting
__GFP_NOFAIL?
Or is the 1-in-100 important? I think default warning is 1 every 10
seconds.

>
> I also wonder about changing the backoff behaviour here (it's a 20ms
> wait right now because there are not early wakeups) will affect the
> behaviour, as __GFP_NOFAIL won't wait for that extra time between
> allocation attempts....

The internal backoff is 100ms if there is much pending writeout, and
there are 16 internal retries. If there is not much pending writeout, I
think it just loops with cond_resched().
So adding 20ms can only be at all interesting when the only way to
reclaim memory is something other than writeout. I don't know how to
think about that.

>
> And, of course, how did you test this? Sometimes we see
> unpredicted behaviours as a result of "simple" changes like this
> under low memory conditions...

I suspect this is close to untestable. While I accept that there might
be a scenario where the change might cause some macro effect, it would
most likely be some interplay with some other subsystem struggling with
memory. Testing XFS by itself would be unlikely to find it.

Thanks,
NeilBrown


>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>
>