Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy

From: Qian Cai
Date: Wed May 27 2020 - 11:42:58 EST


On Wed, May 27, 2020 at 09:33:32PM +0800, Feng Tang wrote:
> Hi Qian,
>
> On Wed, May 27, 2020 at 08:05:49AM -0400, Qian Cai wrote:
> > On Wed, May 27, 2020 at 06:46:06PM +0800, Feng Tang wrote:
> > > Hi Qian,
> > >
> > > On Tue, May 26, 2020 at 10:25:39PM -0400, Qian Cai wrote:
> > > > > > > > [1] https://lkml.org/lkml/2020/3/5/57
> > > > > > >
> > > > > > > Reverted this series fixed a warning under memory pressue.
> > > > > >
> > > > > > Andrew, Stephen, can you drop this series?
> > > > > >
> > > > > > >
> > > > > > > [ 3319.257898] LTP: starting oom01
> > > > > > > [ 3319.284417] ------------[ cut here ]------------
> > > > > > > [ 3319.284439] memory commitment underflow
> > > > >
> > > > > Thanks for the catch!
> > > > >
> > > > > Could you share the info about the platform, like the CPU numbers
> > > > > and RAM size, and what's the mmap test size of your test program.
> > > > > It would be great if you can point me the link to the test program.
> > > >
> > > > I have been reproduced this on both AMD and Intel. The test just
> > > > allocating memory and swapping.
> > > >
> > > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c
> > > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/tunable/overcommit_memory.c
> > > >
> > > > It might be better to run the whole LTP mm tests if none of the above
> > > > triggers it for you which has quite a few memory pressurers.
> > > >
> > > > /opt/ltp/runltp -f mm
> > >
> > > Thanks for sharing. I tried to reproduce this on 2 server plaforms,
> > > but can't reproduce it, and they are still under testing.
> > >
> > > Meanwhile, could you help to try the below patch, which is based on
> > > Andi's suggestion and have some debug info. The warning is a little
> > > strange, as the condition is
> > >
> > > (percpu_counter_read(&vm_committed_as) <
> > > -(s64)vm_committed_as_batch * num_online_cpus())
> > >
> > > while for your platform (48 CPU + 128 GB RAM), the
> > > '-(s64)vm_committed_as_batch * num_online_cpus()'
> > > is a s64 value: '-32G', which makes the condition hard to be true,
> > > and when it is, it could be triggered by some magic for s32/s64
> > > operations around the percpu-counter.
> >
> > Here is the information on AMD and powerpc below affected by this. It
> > could need a bit patient to reproduce, but our usual daily CI would
> > trigger it eventually after a few tries.
> >
> > # git clone https://github.com/cailca/linux-mm.git
> > # cd linux-mm
> > # ./compile.sh
> > # systemctl reboot
> > # ./test.sh
>
> I just downloaded it, and it failed on my desktop machine as it failed
> in 'yum' and 'grub2' setup. The difficulty for me to reproduce is the
> test platforms are behind the 0day framework, and I can hardly setup
> external test suits, though I have been trying for all day today :)

I tried your debug patch and it did not even compile on linux-next
(where the issue was happened) and I am running out of time today. It
probably need to reproduce on large systems as it did not happen on one
of our small s390 system here.