Re: [PATCH v2 mm-hotfixes 0/5] mm, arch: a more robust approach to sync top level kernel page tables
From: Harry Yoo
Date: Sun Jul 20 2025 - 19:59:03 EST
On Mon, Jul 21, 2025 at 08:41:58AM +0900, Harry Yoo wrote:
> RFC v1: https://lore.kernel.org/linux-mm/20250709131657.5660-1-harry.yoo@xxxxxxxxxx
>
> RFC v1 -> v2:
> - Dropped RFC tag.
> - Exposed page table sync code to common code (Mike Rapoport).
> - Used only one Fixes: tag in patch 3 instead of two,
> to avoid confusion (Andrew Morton)
> - Reused existing ARCH_PAGE_TABLE_SYNC_MASK and
> arch_sync_kernel_mappings() facility (currently used by vmalloc and
> ioremap) forpage table sync instead of introducing a new interface.
>
> A quick question: Technically, patch 4 and 5 don't necessarily need to be
> backported. Does it make sense to backport only patch 1-3?
>
> # The problem: It is easy to miss/overlook page table synchronization
>
> Hi all,
Looks like I forgot to Cc: Uladzislau and Joerg.. adding them to Cc.
--
Cheers,
Harry / Hyeonggon
> During our internal testing, we started observing intermittent boot
> failures when the machine uses 4-level paging and has a large amount
> of persistent memory:
>
> BUG: unable to handle page fault for address: ffffe70000000034
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> PGD 0 P4D 0
> Oops: 0002 [#1] SMP NOPTI
> RIP: 0010:__init_single_page+0x9/0x6d
> Call Trace:
> <TASK>
> __init_zone_device_page+0x17/0x5d
> memmap_init_zone_device+0x154/0x1bb
> pagemap_range+0x2e0/0x40f
> memremap_pages+0x10b/0x2f0
> devm_memremap_pages+0x1e/0x60
> dev_dax_probe+0xce/0x2ec [device_dax]
> dax_bus_probe+0x6d/0xc9
> [... snip ...]
> </TASK>
>
> It turns out that the kernel panics while initializing vmemmap
> (struct page array) when the vmemmap region spans two PGD entries,
> because the new PGD entry is only installed in init_mm.pgd,
> but not in the page tables of other tasks.
>
> And looking at __populate_section_memmap():
> if (vmemmap_can_optimize(altmap, pgmap))
> // does not sync top level page tables
> r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
> else
> // sync top level page tables in x86
> r = vmemmap_populate(start, end, nid, altmap);
>
> In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
> synchronizes the top level page table (See commit 9b861528a801
> ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping
> changes")) so that all tasks in the system can see the new vmemmap area.
>
> However, when vmemmap_can_optimize() returns true, the optimized path
> skips synchronization of top-level page tables. This is because
> vmemmap_populate_compound_pages() is implemented in core MM code, which
> does not handle synchronization of the top-level page tables. Instead,
> the core MM has historically relied on each architecture to perform this
> synchronization manually.
>
> We're not the first party to encounter a crash caused by not-sync'd
> top level page tables: earlier this year, Gwan-gyeong Mun attempted to
> address the issue [1] [2] after hitting a kernel panic when x86 code
> accessed the vmemmap area before the corresponding top-level entries
> were synced. At that time, the issue was believed to be triggered
> only when struct page was enlarged for debugging purposes, and the patch
> did not get further updates.
>
> It turns out that current approach of relying on each arch to handle
> the page table sync manually is fragile because 1) it's easy to forget
> to sync the top level page table, and 2) it's also easy to overlook that
> the kernel should not access the vmemmap and direct mapping areas before
> the sync.
>
> # The solution: Make page table sync more code robust
>
> To address this, Dave Hansen suggested [3] [4] introducing
> {pgd,p4d}_populate_kernel() for updating kernel portion
> of the page tables and allow each architecture to explicitly perform
> synchronization when installing top-level entries. With this approach,
> we no longer need to worry about missing the sync step, reducing the risk
> of future regressions.
>
> The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK,
> PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by
> vmalloc and ioremap to synchronize page tables.
>
> pgd_populate_kernel() looks like this:
> #define pgd_populate_kernel(addr, pgd, p4d) \
> do { \
> pgd_populate(&init_mm, pgd, p4d); \
> if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED) \
> arch_sync_kernel_mappings(addr, addr); \
> } while (0)
>
> It is worth noting that vmalloc() and apply_to_range() carefully
> synchronizes page tables by calling p*d_alloc_track() and
> arch_sync_kernel_mappings(), and thus they are not affected by
> this patch series.
>
> This patch series was hugely inspired by Dave Hansen's suggestion and
> hence added Suggested-by: Dave Hansen.
>
> Cc stable because lack of this series opens the door to intermittent
> boot failures.
>
> [1] https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@xxxxxxxxx
> [2] https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@xxxxxxxxx
> [3] https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@xxxxxxxxx
> [4] https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@xxxxxxxxx
>
> Harry Yoo (5):
> mm: move page table sync declarations to asm/pgalloc.h
> mm: introduce and use {pgd,p4d}_populate_kernel()
> x86/mm: define ARCH_PAGE_TABLE_SYNC_MASK and
> arch_sync_kernel_mappings()
> x86/mm: convert p*d_populate{,_init} to _kernel variants
> x86/mm: drop unnecessary calls to sync_global_pgds() and fold into its
> sole user
>
> arch/x86/include/asm/pgalloc.h | 22 ++++++++++++++++++++
> arch/x86/mm/init_64.c | 37 +++++++++++++++++++---------------
> arch/x86/mm/kasan_init_64.c | 8 ++++----
> include/asm-generic/pgalloc.h | 30 +++++++++++++++++++++++++++
> include/linux/vmalloc.h | 16 ---------------
> mm/kasan/init.c | 10 ++++-----
> mm/percpu.c | 4 ++--
> mm/sparse-vmemmap.c | 4 ++--
> mm/vmalloc.c | 1 +
> 9 files changed, 87 insertions(+), 45 deletions(-)
>
> --
> 2.43.0
>