Re: [PATCH] mapletree-vs-khugepaged

From: Liam Howlett
Date: Mon May 30 2022 - 13:39:02 EST


* Guenter Roeck <linux@xxxxxxxxxxxx> [220519 17:42]:
> On 5/19/22 07:35, Liam Howlett wrote:
> > * Guenter Roeck <linux@xxxxxxxxxxxx> [220517 10:32]:
> >
> > ...
> > >
> > > Another bisect result, boot failures with nommu targets (arm:mps2-an385,
> > > m68k:mcf5208evb). Bisect log is the same for both.
> > ...
> > > # first bad commit: [bd773a78705fb58eeadd80e5b31739df4c83c559] nommu: remove uses of VMA linked list
> >
> > I cannot reproduce this on my side, even with that specific commit. Can
> > you point me to the failure log, config file, etc? Do you still see
> > this with the fixes I've sent recently?
> >
>
> This was in linux-next; most recently with next-20220517.
> I don't know if that was up-to-date with your patches.
> The problem seems to be memory allocation failures.
> A sample log is at
> https://kerneltests.org/builders/qemu-m68k-next/builds/1065/steps/qemubuildcommand/logs/stdio
> The log history at
> https://kerneltests.org/builders/qemu-m68k-next?numbuilds=30
> will give you a variety of logs.
>
> The configuration is derived from m5208evb_defconfig, with initrd
> and command line embedded in the image. You can see the detailed
> configuration updates at
> https://github.com/groeck/linux-build-test/blob/master/rootfs/m68k/run-qemu-m68k.sh
>
> Qemu command line is
>
> qemu-system-m68k -M mcf5208evb -kernel vmlinux \
> -cpu m5208 -no-reboot -nographic -monitor none
> -append "rdinit=/sbin/init console=ttyS0,115200"
>
> with initrd from
> https://github.com/groeck/linux-build-test/blob/master/rootfs/m68k/rootfs-5208.cpio.gz
>
> I use qemu v6.2, but any recent qemu version should work.

I have qemu 7.0 which seems to change the default memory size from 32MB
to 128MB. This can be seen on your log here:

Memory: 27928K/32768K available (2827K kernel code, 160K rwdata, 432K rodata, 1016K init, 66K bss, 4840K reserved, 0K cma-reserved)

With 128MB the kernel boots. With 64MB it also boots. 32MB fails with
an OOM. Looking into it more, I see that the OOM is caused by a
contiguous page allocation of 1MB (order 7 at 8K pages). This can be
seen in the log as well:

Running sysctl: echo: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL), nodemask=(null)
...
nommu: Allocation of length 884736 from process 63 (echo) failed

This last log message above comes from the code path that uses
alloc_pages_exact().

I don't see why my 256 byte nodes (order 0 allocations yield 32 nodes)
would fragment the memory beyond use on boot. I have checked for some
sort of massive leak by adding a static node count to the code and have
only ever hit ~12 nodes. Consulting the OOM log from the above link
again:

DMA: 0*8kB 1*16kB (U) 9*32kB (U) 7*64kB (U) 21*128kB (U) 7*256kB (U) 6*512kB (U) 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 8304kB

So to get to the point of breaking up a 1MB block, we'd need an obscene
number of nodes.

Furthermore, the OOM on boot is not always happening. When boot
succeeds without an oom, I checked slabinfo and see that the maple_node
has 32 active objects which is 1 order 0 allocation. The boot does
mostly cause an OOM. It is worth noting that the slabinfo count is lazy
on counting the number of active objects so it is most likely lower than
this value in reality.

Does anyone have any idea why nommu would be getting this fragmented?

Thanks,
Liam