Re: Oops in VMA code

From: Alexander Graf
Date: Thu Jun 16 2011 - 02:12:33 EST



On 16.06.2011, at 08:02, Benjamin Herrenschmidt wrote:

> On Thu, 2011-06-16 at 07:32 +0200, Alexander Graf wrote:
>> On 16.06.2011, at 06:32, Linus Torvalds wrote:
>
>> Thanks a lot for looking at it either way :).
>
> Yeah thanks ;-) Let me see what I can dig out.
>
> First it's a load from what looks like a valid pointer to the linear
> mapping that had one byte corrupted (or more but it looks reasonably
> "clean"). It's not a one bit error, there's at least 2 bad bits (the
> 09):
>
> DAR: c00090026236bbc0
>
> Alex, how much RAM do you have ? If that was just a one byte corruption,
> the above would imply you have something valid between 9 and 10G. From
> the look of other registers, it seems that it could be a genuine pointer
> with just that stay "09" byte that landed onto it.

Heh, you caught me to it. I was just writing up a reply to Linus explaining how I only have 8GB of RAM and how this address has more invalid bits than just the "09". It's either completely garbaged as of the 3rd byte or at least 0x9002 is wrong.

>
>> The latter is the one I'm executing, while the former still has all
>> the symbols. But you're right. It looks like this is simply an inlined
>> function - which is why it got stripped away. Here's the disassembly
>> of the whole do_unmap function. I hope it's of use despite your fading
>> PPC asm skills :). Host compiler is gcc 4.3.4 from SLES11SP1.
>
> .../...
>
> Ok, so let's see what we can dig from here. It -looks- like:
>
> if (!mm) goto out :
>
>> 0xc000000000190554 <find_vma_prev>: cmpdi cr7,r3,0
>> 0xc000000000190558 <find_vma_prev+4>: beq cr7,0xc0000000001907f0 <remove_vma_list+836>
>
> rb_node = mm->mm_rb.rb_node; (rb_node in r9):
>
>> 0xc00000000019055c <find_vma_prev+8>: ld r9,8(r3)
>
> vma = mm->mmap (vma in r28)
>
>> 0xc000000000190560 <find_vma_prev+12>: ld r28,0(r3)
>> 0xc000000000190564 <find_vma_prev+16>: li r11,0
>> 0xc000000000190568 <find_vma_prev+20>: li r26,0
>
> while(rb_node)...
>
>> 0xc00000000019056c <find_vma_prev+24>: cmpdi cr7,r9,0
>> 0xc000000000190570 <find_vma_prev+28>: bne cr7,0xc000000000190594 <find_vma_prev+64>
>> 0xc000000000190574 <find_vma_prev+32>: b 0xc0000000001905d0 <do_munmap+368>
>> 0xc000000000190578 <find_vma_prev+36>: nop
>> 0xc00000000019057c <find_vma_prev+40>: nop
>> 0xc000000000190580 <find_vma_prev+44>: ld r9,16(r9)
>> 0xc000000000190584 <find_vma_prev+48>: mr r26,r11
>> 0xc000000000190588 <find_vma_prev+52>: cmpdi cr7,r9,0
>> 0xc00000000019058c <find_vma_prev+56>: mr r11,r26
>> 0xc000000000190590 <find_vma_prev+60>: beq cr7,0xc0000000001905c4 <find_vma_prev+112>
>
> vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
>
>> 0xc000000000190594 <find_vma_prev+64>: addi r26,r9,-56
>
> if (vma_tmp->vm_end)
>
>> 0xc000000000190598 <find_vma_prev+68>: ld r0,16(r26)
>
> Here we go. So here vma_tmp is crap, which we got out of the rb_tree,
> so it's either corruption or use after free I'd say. It could also be a
> completely unrelated memory corruption of course....

I'm usually pretty sceptic on blaming hardware on memory corruption issues, so this would mean some random could would have overwritten things here. Sounds pretty hard to find to me.

> If you had xmon we could have dug a little bit more to see what's
> before/after etc... but like this it doesn't ring any special bell to
> me.

Yeah, I've since rebooted the machine :). Let's just leave it here and see if maybe someone else stumbles over the same thing, so we can potentially gather some data points. I'd claim it unlikely that this really is related to memory management code.


Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/