Re: [mm] 4e2c82a409: ltp.overcommit_memory01.fail

From: Michal Hocko
Date: Tue Jul 07 2020 - 08:06:27 EST


On Tue 07-07-20 07:43:48, Qian Cai wrote:
>
>
> > On Jul 7, 2020, at 6:28 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >
> > Would you have any examples? Because I find this highly unlikely.
> > OVERCOMMIT_NEVER only works when virtual memory is not largerly
> > overcommited wrt to real memory demand. And that tends to be more of
> > an exception rather than a rule. "Modern" userspace (whatever that
> > means) tends to be really hungry with virtual memory which is only used
> > very sparsely.
> >
> > I would argue that either somebody is running an "OVERCOMMIT_NEVER"
> > friendly SW and this is a permanent setting or this is not used at all.
> > At least this is my experience.
> >
> > So I strongly suspect that LTP test failure is not something we should
> > really lose sleep over. It would be nice to find a way to flush existing
> > batches but I would rather see a real workload that would suffer from
> > this imprecision.
>
> I hear you many times that you really donât care about those use
> cases unless you hear exactly people are using in your world.
>
> For example, when you said LTP oom tests are totally artificial last
> time and how less you care about if they are failing, and I could only
> enjoy their efficiencies to find many issues like race conditions
> and bad error accumulation handling etc that your âreal world use
> casesâ are going to take ages or no way to flag them.

Yes, they are effective at hitting corner cases and that is fine. I
am not dismissing their usefulness. I have tried to explain that many
times but let me try again. Seeing a corner case and think about a
potential fix is one thing. On the other hand it is not really ideal to
treat such a failure a hard regression and consider otherwise useful
functionality/improvement to be reverted without a proper cost benefit
analysis. Sure having corner cases is not really nice but really, look
at this example again. Overcommit setting is a global thing, it is hard
to change it during runtime nilly willy. Because that might have really
detrimental side effects on all workloads running. So it is quite
reasonable to expect that this is either early after the boot or when
the system is in quiescent state when almost nothing but very core
services are running and likelihood that the mode of operation changes.

> There are just too many valid use cases in this wild world. The
> difference is that I admit that I donât know or even aware all the
> use cases, and I donât believe you do as well.

Me neither and I am not claiming that. All I am saying is that a real
risk of a regression is reasonably low that I wouldn't lose sleep over
that. It is perfectly fine to address this pro-actively if the fix is
reasonably maintainable. I was mostly reacting to your pushing for a
revert solely based on LTP results.

LTP is a very useful tool to raise awareness of potential problems but
you shouldn't really follow those results just blindly.

> If a patchset broke the existing behaviors that written exactly in
> the spec, it is then someone has to prove its innocent. For example,
> if nobody is going to rely on something like this now and future, and
> then fix the spec and explain exactly nobody should be rely upon.

I am all for clarifications in the documentation.

--
Michal Hocko
SUSE Labs