Re: [BUG] new copy_hugetlb_page_range() causing crashes
From: Guillaume Morin
Date: Thu Jul 17 2014 - 17:59:50 EST
On 17 Jul 17:33, Naoya Horiguchi wrote:
> In my environment (kernel-3.14.12, libhugetlbfs-utils-2.16-2.fc20.x86_64),
> the crash looks like this:
>
> [root@test_140717-1333 hugetlbfs_test]# $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap
> bash: $: command not found...
> p 0x2200010
> pid 2809
> *** Error in `./heap': break adjusted to free malloc space: 0x0000000002501000 ***
> ======= Backtrace: =========
> /lib64/libc.so.6[0x3940e75cff]
> /lib64/libc.so.6[0x3940e7f121]
> /lib64/libc.so.6(__libc_malloc+0x5c)[0x3940e7ff6c]
> ./heap[0x400767]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x3940e21d65]
> ./heap[0x400619]
> ======= Memory map: ========
> 00400000-00401000 r-xp 00000000 fd:01 272411 /root/hugetlbfs_test/heap
> 00600000-00601000 r--p 00000000 fd:01 272411 /root/hugetlbfs_test/heap
> 00601000-00602000 rw-p 00001000 fd:01 272411 /root/hugetlbfs_test/heap
> 02200000-02600000 rw-p 00000000 00:0c 23209 /anon_hugepage (deleted)
> 02600000-02800000 rw-p 00400000 00:0c 25663 /anon_hugepage (deleted)
> 3940a00000-3940a20000 r-xp 00000000 fd:01 175094 /usr/lib64/ld-2.18.so
> 3940c1f000-3940c20000 r--p 0001f000 fd:01 175094 /usr/lib64/ld-2.18.so
> 3940c20000-3940c21000 rw-p 00020000 fd:01 175094 /usr/lib64/ld-2.18.so
> 3940c21000-3940c22000 rw-p 00000000 00:00 0
> 3940e00000-3940fb4000 r-xp 00000000 fd:01 175095 /usr/lib64/libc-2.18.so
> 3940fb4000-39411b4000 ---p 001b4000 fd:01 175095 /usr/lib64/libc-2.18.so
> 39411b4000-39411b8000 r--p 001b4000 fd:01 175095 /usr/lib64/libc-2.18.so
> 39411b8000-39411ba000 rw-p 001b8000 fd:01 175095 /usr/lib64/libc-2.18.so
> 39411ba000-39411bf000 rw-p 00000000 00:00 0
> 3941200000-3941203000 r-xp 00000000 fd:01 175098 /usr/lib64/libdl-2.18.so
> 3941203000-3941402000 ---p 00003000 fd:01 175098 /usr/lib64/libdl-2.18.so
> 3941402000-3941403000 r--p 00002000 fd:01 175098 /usr/lib64/libdl-2.18.so
> 3941403000-3941404000 rw-p 00003000 fd:01 175098 /usr/lib64/libdl-2.18.so
> 7f3860277000-7f386028c000 r-xp 00000000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1
> 7f386028c000-7f386048b000 ---p 00015000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1
> 7f386048b000-7f386048c000 r--p 00014000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1
> 7f386048c000-7f386048d000 rw-p 00015000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1
> 7f38604a2000-7f38604a6000 rw-p 00000000 00:00 0
> 7f38604a6000-7f38604b6000 r-xp 00000000 fd:01 177014 /usr/lib64/libhugetlbfs.so
> 7f38604b6000-7f38606b5000 ---p 00010000 fd:01 177014 /usr/lib64/libhugetlbfs.so
> 7f38606b5000-7f38606b6000 r--p 0000f000 fd:01 177014 /usr/lib64/libhugetlbfs.so
> 7f38606b6000-7f38606b7000 rw-p 00010000 fd:01 177014 /usr/lib64/libhugetlbfs.so
> 7f38606b7000-7f38606c2000 rw-p 00000000 00:00 0
> 7f38606d5000-7f38606d6000 rw-p 00000000 00:00 0
> 7f38606d6000-7f38606d8000 rw-p 00000000 00:00 0
> 7fff07c44000-7fff07c65000 rw-p 00000000 00:00 0 [stack]
> 7fff07d52000-7fff07d54000 r-xp 00000000 00:00 0 [vdso]
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
>
> Is this crash the same as yours?
It's very similar in the sense that's an assert in malloc. My error is
slightly different but I guess the exact error depends on a few runtime
parameters
> And it seems that this also happens on v3.16-rc5.
> So it might be an upstream bug, not a stable-specific matter.
That's my understanding as well. I just reported it for 3.4 and 3.14
since these were the kernels I could easily try my original test with.
> It looks strange to me that the problem is gone by removing the commit
> 4a705fef98 (although I confirmed it is,) because the kernel's behavior
> shouldn't change unless (is_hugetlb_entry_migration(entry) ||
> is_hugetlb_entry_hwpoisoned(entry)) is true. And I checked with systemtap
> that both these check returned false in the above test program.
> So I'm wondering why the commit makes difference for this test program.
I don't know why I am just thinking about that now. Isn't this the fact
that your patch moved the huge_ptep_get() before
huge_ptep_set_wrprotect() in the pte_present() cow case?
Actually, I've just tried to re-add the huge_ptep_get call for that
case and it's fixing the problem for me...
Hmm, want a patch?
--
Guillaume Morin <guillaume@xxxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/