Re: rsync: page allocation stalls in kernel 4.9.10 to a VessRAID NAS

From: Michal Hocko
Date: Tue Feb 28 2017 - 11:57:23 EST


On Tue 28-02-17 11:19:33, Robert Kudyba wrote:
>
> > On Feb 28, 2017, at 10:15 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >
> > On Tue 28-02-17 09:59:35, Robert Kudyba wrote:
> >>
> >>> On Feb 28, 2017, at 9:40 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >>>
> >>> On Tue 28-02-17 09:33:49, Robert Kudyba wrote:
> >>>>
> >>>>> On Feb 28, 2017, at 9:15 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >>>>> and this one is hitting the min watermark while there is not really
> >>>>> much to reclaim. Only the page cache which might be pinned and not
> >>>>> reclaimable from this context because this is GFP_NOFS request. It is
> >>>>> not all that surprising the reclaim context fights to get some memory.
> >>>>> There is a huge amount of the reclaimable slab which probably just makes
> >>>>> a slow progress.
> >>>>>
> >>>>> That is not something completely surprsing on 32b system I am afraid.
> >>>>>
> >>>>> Btw. is the stall repeating with the increased time or it gets resolved
> >>>>> eventually?
> >>>>
> >>>> Yes and if you mean by repeating itâs not only affecting rsync but
> >>>> you can see just now automount and NetworkManager get these page
> >>>> allocation stalls and kswapd0 is getting heavy CPU load, are there any
> >>>> other settings I can adjust?
> >>>
> >>> None that I am aware of. You might want to talk to FS guys, maybe they
> >>> can figure out who is pinning file pages so that they cannot be
> >>> reclaimed. They do not seem to be dirty or under writeback. It would be
> >>> also interesting to see whether that is a regression. The warning is
> >>> relatively new so you might have had this problem before just haven't
> >>> noticed it.
> >>
> >> We have been getting out of memory errors for a while but those seem
> >> to have gone away.
> >
> > this sounds suspicious. Are you really sure that this is a new problem?
> > Btw. is there any reason to use 32b kernel at all? It will always suffer
> > from a really small lowmemâ
>
> No this has been a problem for a while. Not sure if this server can
> handle 64b itâs a bit old.

Ok, this is unfortunate. There is usually not much interest to fixing
32b issues which are inherent to the used memory model and which are not
regressions which would be fixable, I am afraid.

> >> We did just replace the controller in the VessRAID
> >> as there were some timeouts observed and multiple login/logout
> >> attempts.
> >>
> >> By FS guys do you mean the linux-fsdevel or linux-fsf list?
> >
> > yeah linux-fsdevel. No idea what linux-fsf is. It would be great if you
> > could collect some tracepoints before reporting the issue. At least
> > those in events/vmscan/*.
>
> Will do hereâs a perf report:

this will not tell us much. Tracepoints have much better chance to tell
us how reclaim is progressing.
--
Michal Hocko
SUSE Labs