Re: Kernel 4.17.4 lockup

From: H.J. Lu
Date: Mon Jul 09 2018 - 22:14:26 EST


On Mon, Jul 9, 2018 at 5:44 PM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
> ... cc'ing a few folks who I know have been looking at this code
> lately. The full oops is below if any of you want to take a look.
>
> OK, well, annotating the disassembly a bit:
>
>> (gdb) disass free_pages_and_swap_cache
>> Dump of assembler code for function free_pages_and_swap_cache:
>> 0xffffffff8124c0d0 <+0>: callq 0xffffffff81a017a0 <__fentry__>
>> 0xffffffff8124c0d5 <+5>: push %r14
>> 0xffffffff8124c0d7 <+7>: push %r13
>> 0xffffffff8124c0d9 <+9>: push %r12
>> 0xffffffff8124c0db <+11>: mov %rdi,%r12 // %r12 = pages
>> 0xffffffff8124c0de <+14>: push %rbp
>> 0xffffffff8124c0df <+15>: mov %esi,%ebp // %ebp = nr
>> 0xffffffff8124c0e1 <+17>: push %rbx
>> 0xffffffff8124c0e2 <+18>: callq 0xffffffff81205a10 <lru_add_drain>
>> 0xffffffff8124c0e7 <+23>: test %ebp,%ebp // test nr==0
>> 0xffffffff8124c0e9 <+25>: jle 0xffffffff8124c156 <free_pages_and_swap_cache+134>
>> 0xffffffff8124c0eb <+27>: lea -0x1(%rbp),%eax
>> 0xffffffff8124c0ee <+30>: mov %r12,%rbx // %rbx = pages
>> 0xffffffff8124c0f1 <+33>: lea 0x8(%r12,%rax,8),%r14 // load &pages[nr] into %r14?
>> 0xffffffff8124c0f6 <+38>: mov (%rbx),%r13 // %r13 = pages[i]
>> 0xffffffff8124c0f9 <+41>: mov 0x20(%r13),%rdx //<<<<<<<<<<<<<<<<<<<< GPF here.
> %r13 is 64-byte aligned, so looks like a halfway reasonable 'struct page *'.
>
> %R14 looks OK (0xffff93d4abb5f000) because it points to the end of a
> dynamically-allocated (not on-stack) mmu_gather_batch page. %RBX is
> pointing 50 pages up from the start of the previous page. That makes it
> the 48th page in pages[] after a pointer and two integers in the
> beginning of the structure. That 48 is important because it's way
> larger than the on-stack size of 8.
>
> It's hard to make much sense of %R13 (pages[48] / 0xfffbf0809e304bc0)
> because the vmemmap addresses get randomized. But, I _think_ that's too
> high of an address for a 4-level paging vmemmap[] entry. Does anybody
> else know offhand?
>
> I'd really want to see this reproduced without KASLR to make the oops
> easier to read. It would also be handy to try your workload with all
> the pedantic debugging: KASAN, slab debugging, DEBUG_PAGE_ALLOC, etc...
> and see if it still triggers.

How can I turn them on at boot time?

> Some relevant functions and structures below for reference.
>
> void free_pages_and_swap_cache(struct page **pages, int nr)
> {
> for (i = 0; i < nr; i++)
> free_swap_cache(pages[i]);
> }
>
>
> static void tlb_flush_mmu_free(struct mmu_gather *tlb)
> {
> for (batch = &tlb->local; batch && batch->nr;
> batch = batch->next) {
> free_pages_and_swap_cache(batch->pages, batch->nr);
> }
>
> zap_pte_range()
> {
> if (force_flush)
> tlb_flush_mmu_free(tlb);
> }
>
> ... all the way up to the on-stack-allocated mmu_gather:
>
> void zap_page_range(struct vm_area_struct *vma, unsigned long start,
> unsigned long size)
> {
> struct mmu_gather tlb;
>
>
> #define MMU_GATHER_BUNDLE 8
>
> struct mmu_gather {
> ...
> struct mmu_gather_batch local;
> struct page *__pages[MMU_GATHER_BUNDLE];
> }
>
> struct mmu_gather_batch {
> struct mmu_gather_batch *next;
> unsigned int nr;
> unsigned int max;
> struct page *pages[0];
> };
>
> #define MAX_GATHER_BATCH \
> ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
>



--
H.J.