Re: [PATCH] mapletree-vs-khugepaged

From: Liam Howlett
Date: Mon May 16 2022 - 10:03:03 EST


* Sven Schnelle <svens@xxxxxxxxxxxxx> [220515 16:02]:
> Liam Howlett <liam.howlett@xxxxxxxxxx> writes:
>
> > * Sven Schnelle <svens@xxxxxxxxxxxxx> [220513 10:46]:
> >> Starting today we're still seeing the same crash with linux-next from
> >> (next-20220513):
> >>
> >> [ 211.937897] CPU: 7 PID: 535 Comm: pt_upgrade Not tainted 5.18.0-rc6-11648-g76535d42eb53-dirty #732
> >> [ 211.937902] Unable to handle kernel pointer dereference in virtual kernel address space
> >> [ 211.937903] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> >> [ 211.937906] Failing address: 0e00000000000000 TEID: 0e00000000000803
> >> [ 211.937909] Krnl PSW : 0704c00180000000 0000001ca52f06d6
> >> [ 211.937910] Fault in home space mode while using kernel ASCE.
> >> [ 211.937917] AS:0000001ca6e24007 R3:0000001fffff0007 S:0000001ffffef800 P:000000000000003d
> >> [ 211.937914] (mmap_region+0x19e/0x848)
> >> [ 211.937929] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> >> [ 211.937939] Krnl GPRS: 0000000000000000 0e00000000000000 0000000000000000 0000000000000000
> >> [ 211.937942] ffffffff00000f0f ffffffffffffffff 0e00000000000000 0000040000001000
> >> [ 211.937945] 0000000083551900 0000040000000000 00000000000000fb 000003800070fc58
> >> [ 211.937947] 000000008f490000 0000000000000000 0000001ca52f06b6 000003800070fb48
> >> [ 211.937959] Krnl Code: 0000001ca52f06c6: a7740021 brc 7,0000001ca52f0708
> >> [ 211.937959] 0000001ca52f06ca: ec6801b3007c cgij %r6,0,8,0000001ca52f0a30
> >> [ 211.937959] #0000001ca52f06d0: e310f0f80004 lg %r1,248(%r15)
> >> [ 211.937959] >0000001ca52f06d6: e37010000020 cg %r7,0(%r1)
> >> [ 211.937959] 0000001ca52f06dc: a78400ea brc 8,0000001ca52f08b0
> >> [ 211.937959] 0000001ca52f06e0: e310f0f00004 lg %r1,240(%r15)
> >> [ 211.937959] 0000001ca52f06e6: ec180008007c cgij %r1,0,8,0000001ca52f06f6
> >> [ 211.937959] 0000001ca52f06ec: e39010080020 cg %r9,8(%r1)
> >> [ 211.937973] Call Trace:
> >> [ 211.937975] [<0000001ca52f06d6>] mmap_region+0x19e/0x848
> >> [ 211.937978] ([<0000001ca52f06b6>] mmap_region+0x17e/0x848)
> >> [ 211.937981] [<0000001ca52f116a>] do_mmap+0x3ea/0x4c8
> >> [ 211.937983] [<0000001ca52bed12>] vm_mmap_pgoff+0xda/0x178
> >> [ 211.937987] [<0000001ca52ed5ea>] ksys_mmap_pgoff+0x62/0x238
> >> [ 211.937989] [<0000001ca52ed992>] __s390x_sys_old_mmap+0x7a/0xa0
> >> [ 211.937993] [<0000001ca5c4ef5c>] __do_syscall+0x1d4/0x200
> >> [ 211.937999] [<0000001ca5c5d572>] system_call+0x82/0xb0
> >> [ 211.938002] Last Breaking-Event-Address:
> >> [ 211.938003] [<0000001ca5888616>] mas_prev+0xb6/0xc0
> >> [ 211.938010] Oops: 0038 ilc:3 [#2]
> >> [ 211.938011] Kernel panic - not syncing: Fatal exception: panic_on_oops
> >> [ 211.938012] SMP
> >> [ 211.938014] Modules linked in:
> >> 07: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 0000001C
> >> A50679A6
> >>
> >> IS that issue supposed to be fixed? git bisect pointed me to
> >>
> >> # bad: [76535d42eb53485775a8c54ea85725812b75543f] Merge branch
> >> 'mm-everything' of
> >> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> >>
> >> which isn't really helpful.
> >>
> >> Anything we could help with debugging this?
> >
> > I tested the maple tree on top of the s390 as it was the same crash and
> > it was okay. I haven't tested the mm-everything branch though. Can you
> > test mm-unstable?
>
> Yes, i tested mm-unstable but wasn't able to reproduce the issue.
>
> > I'll continue setting up a sparc VM for testing here and test
> > mm-everything on that and the s390
>
> One thing that is different compared to x86 is that both sparc and s390
> are big endian. Not sure whether and where that would make a difference.
>
> The code to trigger the crash on s390 is rather simple: Just force a
> paging level upgrade to 5 levels by calling mmap() with an address that
> doesn't fit in 3 levels. Haven't tested whether an upgrade to 4 levels
> would be sufficent. I've condensed our test case that triggers this, and
> basically all that is required is:
>
> --------------------------------8<---------------------------------------
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <sys/wait.h>
> #include <stdio.h>
>
> #define PAGE_SIZE 0x1000
> #define _REGION1_SIZE (1UL << 54)
>
> int main(int argc, char *argv[])
> {
> int pid, status;
> void *addr;
>
> pid = fork();
> if (pid == 0) {
> /*
> * Trigger page table level upgrade
> */
> addr = mmap((void *)_REGION1_SIZE, PAGE_SIZE, PROT_READ | PROT_WRITE,
> MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> if (addr == MAP_FAILED)
> return 1;
> *(int *)addr = 1;
> return 0;
> }
> wait(&status);
> return 0;
> }
> --------------------------------8<---------------------------------------
>

I tried the above on my qemu s390 with kernel 5.18.0-rc6-next-20220513,
but it runs without issue, return code is 0. Is there something the VM
needs to have for this to trigger?

> I've added a few debug statements to the maple tree code:
>
> [ 27.769641] mas_next_entry: offset=14
> [ 27.769642] mas_next_nentry: entry = 0e00000000000000, slots=0000000090249f80, mas->offset=15 count=14

Where exactly are you printing this?

>
> I see in mas_next_nentry() that there's a while that iterates over the
> (used?) slots until count is reached.`

Yes, mas_next_nentry() looks for the next non-null entry in the current
node.

>After that loop mas_next_entry()
> just picks the next (unused?) entry, which is slot 15 in that case.

mas_next_entry() returns the next non-null entry. If there isn't one
returned by mas_next_nentry(), then it will advance to the next node by
calling mas_next_node(). There are checks in there for detecting dead
nodes for RCU use and limit checking as well.

>
> What i noticed while scanning over include/linux/maple_tree.h is:
>
> struct maple_range_64 {
> struct maple_pnode *parent;
> unsigned long pivot[MAPLE_RANGE64_SLOTS - 1];
> union {
> void __rcu *slot[MAPLE_RANGE64_SLOTS];
> struct {
> void __rcu *pad[MAPLE_RANGE64_SLOTS - 1];
> struct maple_metadata meta;
> };
> };
> };
>
> and struct maple_metadata is:
>
> struct maple_metadata {
> unsigned char end;
> unsigned char gap;
> };
>
> If i swap the gap and end members 0x0e00000000000000 becomes
> 0x000e000000000000. And 0xe matches our msa->offset 14 above.
> So it looks like mas_next() in mmap_region returns the meta
> data for the node.

If this is the case, then I think any task that has more than 14 VMAs
would have issues. I also use mas_next_entry() in mas_find() which is
used for the mas_for_each() macro/iterator. Can you please enable
CONFIG_DEBUG_VM_MAPLE_TREE ? mmap.c tests the tree after pretty much
any change and will dump useful information if there is an issue -
including the entire tree. See validate_mm_mt() for details.

You can find CONFIG_DEBUG_VM_MAPLE_TREE in the config:
kernel hacking -> Memory debugging -> Debug VM -> Debug VM maple trees

>
> So from the lines above you likely already guessed that i have no clue
> how mapple tree works, and i didn't had enough time today to read all
> the magic and understand it. But i thought i just drop my observation
> here in case someone has an idea.

Thanks for sharing. I'm having a hard time recreating the issue so I
cannot fully dig in myself.



I was able to boot spar64 with mm-unstable. I did get an error:
[ 5.002625] Kernel unaligned access at TPC[59bae8]
mmap_region+0x168/0xb00

faddr2line is less than useful though with reported line "at ??:?"

I'll keep digging into that.

Thanks,
Liam