Re: [PATCH 4.15 049/146] mm/page_alloc: fix memmap_init_zone pageblock alignment

From: Daniel Vacek
Date: Tue Mar 13 2018 - 17:48:11 EST


On Tue, Mar 13, 2018 at 8:34 PM, Dan Rue <dan.rue@xxxxxxxxxx> wrote:
> On Tue, Mar 13, 2018 at 04:23:36PM +0100, Greg Kroah-Hartman wrote:
>> 4.15-stable review patch. If anyone has any objections, please let me know.
>
> On 4.14 and 4.15, this patch breaks booting on dragonboard 410c and
> hikey 620 (both arm64). The fix has been proposed and tested but is not
> yet in mainline per https://lkml.org/lkml/2018/3/12/710

I'll send a formal fix today.

> Dan
>
>>
>> ------------------
>>
>> From: Daniel Vacek <neelx@xxxxxxxxxx>
>>
>> commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae upstream.
>>
>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>> where possible") introduced a bug where move_freepages() triggers a
>> VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
>> To fix this, simply align the skipped pfns in memmap_init_zone() the
>> same way as in move_freepages_block().
>>
>> Seen in one of the RHEL reports:
>>
>> crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
>> kernel BUG at mm/page_alloc.c:1389!
>> invalid opcode: 0000 [#1] SMP
>> --
>> RIP: 0010:[<ffffffff8118833e>] [<ffffffff8118833e>] move_freepages+0x15e/0x160
>> RSP: 0018:ffff88054d727688 EFLAGS: 00010087
>> --
>> Call Trace:
>> [<ffffffff811883b3>] move_freepages_block+0x73/0x80
>> [<ffffffff81189e63>] __rmqueue+0x263/0x460
>> [<ffffffff8118c781>] get_page_from_freelist+0x7e1/0x9e0
>> [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
>> --
>> RIP [<ffffffff8118833e>] move_freepages+0x15e/0x160
>> RSP <ffff88054d727688>
>>
>> crash> page_init_bug -v | grep RAM
>> <struct resource 0xffff88067fffd2f8> 1000 - 9bfff System RAM (620.00 KiB)
>> <struct resource 0xffff88067fffd3a0> 100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
>> <struct resource 0xffff88067fffd410> 4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
>> <struct resource 0xffff88067fffd480> 4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
>> <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB)
>> <struct resource 0xffff88067fffd640> 100000000 - 67fffffff System RAM ( 22.00 GiB)
>>
>> crash> page_init_bug | head -6
>> <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB)
>> <struct page 0xffffea0001ede200> 1fffff00000000 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575
>> <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
>> <struct page 0xffffea0001ed8000> 0 0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA 1 4095
>> <struct page 0xffffea0001edffc0> 1fffff00000400 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575
>> BUG, zones differ!
>>
>> Note that this range follows two not populated sections
>> 68000000-77ffffff in this zone. 7b788000-7b7fffff is the first one
>> after a gap. This makes memmap_init_zone() skip all the pfns up to the
>> beginning of this range. But this range is not pageblock (2M) aligned.
>> In fact no range has to be.
>>
>> crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
>> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
>> ffffea0001e00000 78000000 0 0 0 0
>> ffffea0001ed7fc0 7b5ff000 0 0 0 0
>> ffffea0001ed8000 7b600000 0 0 0 0 <<<<
>> ffffea0001ede1c0 7b787000 0 0 0 0
>> ffffea0001ede200 7b788000 0 0 1 1fffff00000000
>>
>> Top part of page flags should contain nodeid and zonenr, which is not
>> the case for page ffffea0001ed8000 here (<<<<).
>>
>> crash> log | grep -o fffea0001ed[^\ ]* | sort -u
>> fffea0001ed8000
>> fffea0001eded20
>> fffea0001edffc0
>>
>> crash> bt -r | grep -o fffea0001ed[^\ ]* | sort -u
>> fffea0001ed8000
>> fffea0001eded00
>> fffea0001eded20
>> fffea0001edffc0
>>
>> Initialization of the whole beginning of the section is skipped up to
>> the start of the range due to the commit b92df1de5d28. Now any code
>> calling move_freepages_block() (like reusing the page from a freelist as
>> in this example) with a page from the beginning of the range will get
>> the page rounded down to start_page ffffea0001ed8000 and passed to
>> move_freepages() which crashes on assertion getting wrong zonenr.
>>
>> > VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
>>
>> Note, page_zone() derives the zone from page flags here.
>>
>> >From similar machine before commit b92df1de5d28:
>>
>> crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
>> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
>> fffff73941e00000 78000000 0 0 1 1fffff00000000
>> fffff73941ed7fc0 7b5ff000 0 0 1 1fffff00000000
>> fffff73941ed8000 7b600000 0 0 1 1fffff00000000
>> fffff73941edff80 7b7fe000 0 0 1 1fffff00000000
>> fffff73941edffc0 7b7ff000 ffff8e67e04d3ae0 ad84 1 1fffff00020068 uptodate,lru,active,mappedtodisk
>>
>> All the pages since the beginning of the section are initialized.
>> move_freepages()' not gonna blow up.
>>
>> The same machine with this fix applied:
>>
>> crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
>> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
>> ffffea0001e00000 78000000 0 0 0 0
>> ffffea0001e00000 7b5ff000 0 0 0 0
>> ffffea0001ed8000 7b600000 0 0 1 1fffff00000000
>> ffffea0001edff80 7b7fe000 0 0 1 1fffff00000000
>> ffffea0001edffc0 7b7ff000 ffff88017fb13720 8 2 1fffff00020068 uptodate,lru,active,mappedtodisk
>>
>> At least the bare minimum of pages is initialized preventing the crash
>> as well.
>>
>> Customers started to report this as soon as 7.4 (where b92df1de5d28 was
>> merged in RHEL) was released. I remember reports from
>> September/October-ish times. It's not easily reproduced and happens on
>> a handful of machines only. I guess that's why. But that does not make
>> it less serious, I think.
>>
>> Though there actually is a report here:
>> https://bugzilla.kernel.org/show_bug.cgi?id=196443
>>
>> And there are reports for Fedora from July:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1473242
>> and CentOS:
>> https://bugs.centos.org/view.php?id=13964
>> and we internally track several dozens reports for RHEL bug
>> https://bugzilla.redhat.com/show_bug.cgi?id=1525121
>>
>> Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@xxxxxxxxxx
>> Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
>> Signed-off-by: Daniel Vacek <neelx@xxxxxxxxxx>
>> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
>> Cc: Michal Hocko <mhocko@xxxxxxxx>
>> Cc: Paul Burton <paul.burton@xxxxxxxxxx>
>> Cc: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>
>> Cc: Vlastimil Babka <vbabka@xxxxxxx>
>> Cc: <stable@xxxxxxxxxxxxxxx>
>> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>>
>> ---
>> mm/page_alloc.c | 9 +++++++--
>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5353,9 +5353,14 @@ void __meminit memmap_init_zone(unsigned
>> /*
>> * Skip to the pfn preceding the next valid one (or
>> * end_pfn), such that we hit a valid pfn (or end_pfn)
>> - * on our next iteration of the loop.
>> + * on our next iteration of the loop. Note that it needs
>> + * to be pageblock aligned even when the region itself
>> + * is not. move_freepages_block() can shift ahead of
>> + * the valid region but still depends on correct page
>> + * metadata.
>> */
>> - pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1;
>> + pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
>> + ~(pageblock_nr_pages-1)) - 1;
>> #endif
>> continue;
>> }
>>
>>