Re: [linus:master] [hugetlb] 7118fc2906: kernel_BUG_at_lib/list_debug.c

From: Feng Tang
Date: Wed Jan 18 2023 - 10:13:48 EST


On Wed, Jan 18, 2023 at 02:35:17PM +0100, Vlastimil Babka wrote:
> On 1/17/23 19:25, Linus Torvalds wrote:
> > On Tue, Jan 17, 2023 at 4:22 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> >>
> >> With the following patch to use 'O1' instead 'O2' gcc optoin for
> >> page_alloc.c, the list corruption issue can't be reproduced for
> >> commit 7118fc2906 in 1000 runs.
> >
> > Ugh.
> >
> > It would be lovely if you could just narrow it down with
> >
> > #pragma GCC optimize ("O1")
> > ...
> > #pragma GCC optimize ("O2")
> >
> > around just that prep_compound_page(), but when I tried it myself I
> > get some function attribute mismatch errors.
> >
> >
> >> As is can't be reproduced with X86_64 build, it could be i386
> >> compiling related.
> >
> > Your particular config causes a huge amount of nasty 64-bit arithmetic
> > according to the objdump code, with sequences like
> >
> > c13b3cbb: 83 05 d0 28 6c c5 01 addl $0x1,0xc56c28d0
> > c13b3cc2: 83 15 d4 28 6c c5 00 adcl $0x0,0xc56c28d4
> >
> > which seems to be just from some coverage profiling being on
> > (CONFIG_GCOV?), or something. It makes it very hard to read the code.
>
> I think whatever this is, it's also responsible for the bug.
> Now from all the dumps I've seen so far, the struct page corruptions looked
> like prep_compound_page() is called more times than it should. Which didn't
> make sense as the code is very simple, a loop of 1 << order - 1 iterations.
>
> If I look at Feng's disassembly of 48b8d744ea84 (the "good" parent commit),
> I can work out that edi holds the 1 << order, ebx is initialized as 1, and
> there's a nice clear "inc ebx, cmp ebx, edi, jne <loop>".
>
> Now the disassembly of the "bad" commit is much harder to follow when it
> comes to the iteration control. There are xors and or's involved so I didn't
> even try to work out the meaning. Clearly it can't be deterministically
> wrong or it would crash always and fast. But crucially I think it uses
> memory addresses 0xc56c28f8 and 0xc56c28fc in a racy way and that could
> explain the bug.
>
> Before the loop iteration itself, it initializes several registers from
> these addreses, and the values end up modifying also registers used for the
> loop iteration control (ebx/esi) And crucially in the loop iteration it also
> writes some values to the same addresses, and there seems to be no
> synchronization whatsoever. So it seems to me, even if these addresses are
> supposed to be "private" to prep_compound_page(), running two
> prep_compound_page() in parallel, or even due to interrupt (Hyeonggon did
> reproduce this with -smp 1 as well) can result in a race ultimately
> affecting the number of iterations taken.

Thanks for the decoding!

For the strange memory address 0xc56c28xx in the assembly, they are
GCOV related memory for prep_compound_page(), as from the System.map:

c56c28a0 b __gcov0.free_compound_page
c56c28c0 b __gcov0.prep_compound_page
c56c2918 b __gcov0.early_debug_pagealloc

And your analysis of racing of these region makes sense, that some
pointer could be changed when a race happens. From the memory dump
in previous thread, it was always one off issue, that page[2] for
order 1 and page[8] for order 3 were wrongly written.

Thanks,
Feng

> Attaching the disasm I annotated for more details. Everything register whose
> value is coming from 0xc56c28f8/0xc56c28fc in some way is annotated as
> "crap" there (sorry).
>
> > You also have UBSAN enabled, which - again - makes for some really
> > grotty asm that hides any actual logic.
> >
> > Finally, your objdump version also does some horrendous decoding, like
> >
> > c13b3e29: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
> >
> > which is just a 7-byte 'nop' instruction, but again, it makes it
> > rather hard to actually read the code.
> >
> > With the i386 defconfig, gcc generates a function that is just ~30
> > instructions for me, so this makes a huge difference in the legibility
> > of the code.
> >
> > I wonder if you can recreate the issue with a much more
> > straightforward config. By all means, leave DEBUG_PAGEALLOC and SLUB
> > debugging on, but without the things like UBSAN and GCOV.
> >
> >> I also objdumped 'prep_compound_page' for vmlinux of 7118fc2906 and
> >> its parent commit 48b8d744ea84, which have big difference than the
> >> simple 'set_page_count()' change, but I can't tell which part is
> >> abnormal, so attach them for further check.
> >
> > Yeah, I can't make heads or tails of them either, see above on how
> > illegible the objdump files are. And that's despite not even having
> > all of prep_compound_page() in them (it's missing
> > prep_compound_page.cold, which is probably just UBSAN fixup code, but
> > who knows..)
> >
> > That said, with the i386 defconfig, the only change from adding
> > set_page_count() to the loop seems to be exactly that:
> >
> > .L589:
> > - movl $1024, 12(%eax)
> > + movl $0, 28(%eax)
> > addl $32, %eax
> > + movl $1024, -20(%eax)
> > movl %esi, -28(%eax)
> > movl $0, -12(%eax)
> > cmpl %edx, %eax
> >
> > (don't ask me why gcc does *one* access using the pre-incremented
> > pointer, and then the rest to the post-incremented ones, but whatever
> > - it means that it's not just "add a mov $0", it's also changing how
> > the
> >
> > p->mapping = TAIL_MAPPING;
> >
> > instruction is done, which is that
> >
> > - movl $1024, 12(%eax)
> > + movl $1024, -20(%eax)
> >
> > part of the change)
> >
> > Linus

> inputs:
> eax: page
> edx: order
>
> stack below ebp
> -0x04 saved edi
> -0x08 saved esi
> -0x0c saved ebp
> (sub $0x14,%esp)
> -0x10 crap
> -0x14 crap
> -0x18 page|1 (compound_head value)
> -0x1c copy of order
> -0x20 page
> -0x24
> esp
>
> eax = page|1 (compound_head value) / crap
> edx = page / crap
> edi = page / crap
> ebx = 1 << order (=2) / 1 << order -2 (=0) / crap!
> ecx = page + 1 (first tail page)
> esi = 0 / crap
>
> c13b3b90 <prep_compound_page>:
> c13b3b90: 55 push %ebp
> c13b3b91: 89 e5 mov %esp,%ebp
> c13b3b93: 57 push %edi
> c13b3b94: 89 c7 mov %eax,%edi edi = page
> c13b3b96: 56 push %esi
> c13b3b97: 53 push %ebx
> c13b3b98: 83 ec 14 sub $0x14,%esp
> c13b3b9b: 83 fa 1f cmp $0x1f,%edx
> c13b3b9e: 89 55 e4 mov %edx,-0x1c(%ebp)
> c13b3ba1: 0f 87 33 31 ed 01 ja c3286cda <prep_compound_page.cold> only when order > 31 (?)
> c13b3ba7: 0f b6 4d e4 movzbl -0x1c(%ebp),%ecx ecx = order
> c13b3bab: bb 01 00 00 00 mov $0x1,%ebx ebx = 1
> c13b3bb0: d3 e3 shl %cl,%ebx ebx = 1 << order (=2)
> c13b3bb2: 83 3f ff cmpl $0xffffffff,(%edi)
> c13b3bb5: 0f 84 65 02 00 00 je c13b3e20 <prep_compound_page+0x290> only when page flags == 0xffffffff (?)
> c13b3bbb: 83 05 d0 28 6c c5 01 addl $0x1,0xc56c28d0
> c13b3bc2: 83 15 d4 28 6c c5 00 adcl $0x0,0xc56c28d4
> c13b3bc9: 0f ba 2f 10 btsl $0x10,(%edi) sets bit 0x10 (PG_head) in page flags
> c13b3bcd: 83 05 f0 28 6c c5 01 addl $0x1,0xc56c28f0
> c13b3bd4: 83 15 f4 28 6c c5 00 adcl $0x0,0xc56c28f4
> c13b3bdb: 83 fb 01 cmp $0x1,%ebx
> c13b3bde: 0f 8e 80 00 00 00 jle c13b3c64 <prep_compound_page+0xd4> nr_pages <= 1 -> skip over the whole loop
> c13b3be4: 8d 47 01 lea 0x1(%edi),%eax eax = page|1 (compound_head value)
> c13b3be7: 8b 15 fc 28 6c c5 mov 0xc56c28fc,%edx edx now some crap, we don't know how this was initialized !!!
> c13b3bed: 89 45 e8 mov %eax,-0x18(%ebp) saves the page|1
> c13b3bf0: a1 f8 28 6c c5 mov 0xc56c28f8,%eax eax now some crap, again we don't know how this was initialized !!!
> c13b3bf5: 8d 4f 28 lea 0x28(%edi),%ecx ecx = edi + 0x28 - first tail page
> c13b3bf8: 89 7d e0 mov %edi,-0x20(%ebp) saves page pointer
> c13b3bfb: 83 c0 01 add $0x1,%eax crap + 1
> c13b3bfe: 89 45 ec mov %eax,-0x14(%ebp) saves the crap
> c13b3c01: 83 d2 00 adc $0x0,%edx crap in crap out
> c13b3c04: a1 f8 28 6c c5 mov 0xc56c28f8,%eax reset as crap
> c13b3c09: 89 55 f0 mov %edx,-0x10(%ebp) save more crap
> c13b3c0c: 8b 15 fc 28 6c c5 mov 0xc56c28fc,%edx reset as crap
> c13b3c12: 83 eb 02 sub $0x2,%ebx ebx is nr_pages - 2 (for order-1 was 2, now 0) - remaining tail pages after first iteration?
> c13b3c15: 31 f6 xor %esi,%esi esi is 0
> c13b3c17: 83 c0 02 add $0x2,%eax crap + 2
> c13b3c1a: 83 d2 00 adc $0x0,%edx crap
> c13b3c1d: 01 c3 add %eax,%ebx ebx is now modified by crap !!! !!!
> c13b3c1f: 8b 45 ec mov -0x14(%ebp),%eax reset as crap
> c13b3c22: 11 d6 adc %edx,%esi esi now also crap
> c13b3c24: 8b 55 f0 mov -0x10(%ebp),%edx reset as crap
> c13b3c27: 89 f7 mov %esi,%edi edi now also crap, oh my
> c13b3c29: 89 de mov %ebx,%esi supposed to be the remaining tail pages, byt modified by crap
> c13b3c2b: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi elaborate nop
> c13b3c2f: 90 nop nop
>
> // this should be the loop for prep_compound_tail
> c13b3c30: a3 f8 28 6c c5 mov %eax,0xc56c28f8 uh now we saved some value to the memory we read previously assuming some known good value? !!!
> c13b3c35: 8b 5d e8 mov -0x18(%ebp),%ebx ebx = page|1 (compound_head value)
> c13b3c38: 83 c0 01 add $0x1,%eax still crap
> c13b3c3b: 89 15 fc 28 6c c5 mov %edx,0xc56c28fc and now we wrote to the other address as well? !!!
> c13b3c41: 83 d2 00 adc $0x0,%edx still crap
> c13b3c44: 83 c1 28 add $0x28,%ecx ecx pointed to first tail pages, now points to second tail page
> c13b3c47: c7 41 e4 00 04 00 00 movl $0x400,-0x1c(%ecx) (first tail page)->mapping = TAIL_MAPPING;
> c13b3c4e: 89 59 dc mov %ebx,-0x24(%ecx) (first tail page)->compound_head is now set
> c13b3c51: 89 fb mov %edi,%ebx ebx now crap
> c13b3c53: 31 d3 xor %edx,%ebx ebx xored with more crap
> c13b3c55: 89 5d ec mov %ebx,-0x14(%ebp) we might need that crap later
> c13b3c58: 89 f3 mov %esi,%ebx supposed to be the remaining tail pages, byt modified by crap
> c13b3c5a: 31 c3 xor %eax,%ebx yeah why not xor it with more crap
> c13b3c5c: 0b 5d ec or -0x14(%ebp),%ebx and "or" it with the previously saved crap
> c13b3c5f: 75 cf jne c13b3c30 <prep_compound_page+0xa0> and that's how we decided if we should do another iteration :(
>
> c13b3c61: 8b 7d e0 mov -0x20(%ebp),%edi restore edi = page
>
> // here we land if we skip everything at c13b3bde
> c13b3c64: c6 47 30 01 movb $0x1,0x30(%edi)
> // the rest shouldn't be important