Re: KVM guest sometimes failed to boot because of kernel stack overflow if KPTI is enabled on a hisilicon ARM64 platform.

From: Wei Xu
Date: Fri Jun 22 2018 - 06:46:22 EST


Hi Will,

On 2018/6/22 17:23, Will Deacon wrote:
Hi Wei,

On Fri, Jun 22, 2018 at 09:33:04AM +0100, Wei Xu wrote:
On 2018/6/21 11:54, Will Deacon wrote:
On Thu, Jun 21, 2018 at 11:14:28AM +0100, Wei Xu wrote:
On 2018/6/21 10:18, Will Deacon wrote:
Wei -- does the diff below help at all? Make sure you disable CONFIG_KASAN,
otherwise your kernel will take an age to boot.
Yes, amazing! This patch resolved the issue.
Great...

I have tested 50 times and can not reproduce the issue any more.
Could you please tell more why this patch works?
You might need to ask your CPU design team ;)

Without this patch, the code in idmap_kpti_install_ng_mappings() sets
bit 11 in table descriptors so that we can keep track of which parts of
the page table we've visited. With this patch, we don't bother tracking
and potentially rewalk parts of the page table (which takes a very long
time if KASAN is enabled).
Got it. Thanks!

The architecture documents I've looked at are clear that bit 11 is IGNORED
by the CPU, which:

"Indicates that the architecture guarantees that the bit or field is not
interpreted or modified by hardware."

Please can you double-check that your CPU is indeed ignoring bit 11 in
non-leaf (table) descriptors?
Do the non-leaf(table) descriptors mean the table descriptors
of the section D4.3.1 "VMSAv8-64 translation table level 0, level 1, and level 2 descriptor formats"
in the ARM Architecture Reference Manual ARMv8 for ARMv8-A(DDI0487C_a_armv8_arm.pdf)?

If yes, our hardware does ignore it(not interpret or modify).
Ok, thanks for checking.

Is there any other possible reason cause this?
Perhaps just writing back the table entries is enough to cause the issue,
although I really can't understand why that would be the case. Can you try
the diff below (without my previous change), please?

Thanks!
But it does not resolve the issue(only apply this patch based on 4.17.0).
The log is as below:

estuary:/$ ./qemu-system-aarch64 -machine virt,kernel_irqchip=on,gic-version=3
-cpu host -enable-kvm -smp 1 -m 1024 -kernel ./Image-4.17-joyx -initrd
../mini-rootfs-arm64.cpio.gz -nographic -append "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000"
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x480fd010]
[ 0.000000] Linux version 4.17.0-45865-gc58dc48 (joyx@Turing-Arch-b) (gcc version 4.9.1 20140505 (prerelease) (crosstool-NG linaro-1.13.1-4.9-2014.05 - Linaro GCC 4.9-2014.05)) #14 SMP PREEMPT Fri Jun 22 18:26:01 CST 2018
[ 0.000000] Machine model: linux,dummy-virt
[ 0.000000] earlycon: pl11 at MMIO 0x0000000009000000 (options '')
[ 0.000000] bootconsole [pl11] enabled
[ 0.000000] efi: Getting EFI parameters from FDT:
[ 0.000000] efi: UEFI not found.
[ 0.000000] cma: Reserved 16 MiB at 0x000000007f000000
[ 0.000000] NUMA: No NUMA configuration found
[ 0.000000] NUMA: Faking a node at [mem 0x0000000000000000-0x000000007fffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x7efeb300-0x7efecdff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA32 [mem 0x0000000040000000-0x000000007fffffff]
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000040000000-0x000000007fffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000040000000-0x000000007fffffff]
[ 0.000000] psci: probing for conduit method from DT.
[ 0.000000] psci: PSCIv1.0 detected in firmware.
[ 0.000000] psci: Using standard PSCI v0.2 function IDs
[ 0.000000] psci: Trusted OS migration not required
[ 0.000000] psci: SMC Calling Convention v1.1
[ 0.000000] random: get_random_bytes called from start_kernel+0xa8/0x418 with crng_init=0
[ 0.000000] percpu: Embedded 24 pages/cpu @ (ptrval) s57984 r8192 d32128 u98304
[ 0.000000] Detected VIPT I-cache on CPU0
[ 0.000000] CPU features: detected: Kernel page table isolation (KPTI)
[ 0.000000] CPU features: detected: Hardware dirty bit management
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 258048
[ 0.000000] Policy zone: DMA32
[ 0.000000] Kernel command line: rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000
[ 0.000000] Memory: 968436K/1048576K available (10044K kernel code, 1328K rwdata, 4840K rodata, 1216K init, 409K bss, 63756K reserved, 16384K cma-reserved)
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[ 0.000000] Preemptible hierarchical RCU implementation.
[ 0.000000] RCU restricting CPUs from NR_CPUS=128 to nr_cpu_ids=1.
[ 0.000000] Tasks RCU enabled.
[ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[ 0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[ 0.000000] GICv3: Distributor has no Range Selector support
[ 0.000000] GICv3: no VLPI support, no direct LPI support
[ 0.000000] ITS [mem 0x08080000-0x0809ffff]
[ 0.000000] ITS@0x0000000008080000: allocated 8192 Devices @7d830000 (indirect, esz 8, psz 64K, shr 1)
[ 0.000000] ITS@0x0000000008080000: allocated 8192 Interrupt Collections @7d840000 (flat, esz 8, psz 64K, shr 1)
[ 0.000000] GIC: using LPI property table @0x000000007d850000
[ 0.000000] ITS: Allocated 1792 chunks for LPIs
[ 0.000000] GICv3: CPU0: found redistributor 0 region 0:0x00000000080a0000
[ 0.000000] CPU0: using LPI pending table @0x000000007d860000
[ 0.000000] GIC: PPI11 is secure or misconfigured
[ 0.000000] arch_timer: WARNING: Invalid trigger for IRQ3, assuming level low
[ 0.000000] arch_timer: WARNING: Please fix your firmware
[ 0.000000] arch_timer: cp15 timer(s) running at 100.00MHz (virt).
[ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x171024e7e0, max_idle_ns: 440795205315 ns
[ 0.000002] sched_clock: 56 bits at 100MHz, resolution 10ns, wraps every 4398046511100ns
[ 0.000844] Console: colour dummy device 80x25
[ 0.001406] Calibrating delay loop (skipped), value calculated using timer frequency.. 200.00 BogoMIPS (lpj=400000)
[ 0.002458] pid_max: default: 32768 minimum: 301
[ 0.002944] Security Framework initialized
[ 0.003521] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
[ 0.004322] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
[ 0.005022] Mount-cache hash table entries: 2048 (order: 2, 16384 bytes)
[ 0.005797] Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes)
[ 0.025904] ASID allocator initialised with 32768 entries
[ 0.029913] Hierarchical SRCU implementation.
[ 0.034285] Platform MSI: its domain created
[ 0.034740] PCI/MSI: /intc/its domain created
[ 0.035318] EFI services will not be available.
[ 0.037943] smp: Bringing up secondary CPUs ...
[ 0.038410] smp: Brought up 1 node, 1 CPU
[ 0.038815] SMP: Total of 1 processors activated.
[ 0.039300] CPU features: detected: GIC system register CPU interface
[ 0.039946] CPU features: detected: Privileged Access Never
[ 0.040506] CPU features: detected: User Access Override
[ 0.042439] Insufficient stack space to handle exception!
[ 0.042441] ESR: 0x96000046 -- DABT (current EL)
[ 0.043752] FAR: 0xffff0000093a80e0
[ 0.044207] Task stack: [0xffff0000093a8000..0xffff0000093ac000]
[ 0.046511] IRQ stack: [0xffff000008000000..0xffff000008004000]
[ 0.052899] Overflow stack: [0xffff80003efce2f0..0xffff80003efcf2f0]
[ 0.059396] CPU: 0 PID: 12 Comm: migration/0 Not tainted 4.17.0-45865-gc58dc48 #14
[ 0.067018] Hardware name: linux,dummy-virt (DT)
[ 0.071710] pstate: 604003c5 (nZCv DAIF +PAN -UAO)
[ 0.076532] pc : el1_sync+0x0/0xb0
[ 0.080028] lr : kpti_install_ng_mappings+0x120/0x214
[ 0.085197] sp : ffff0000093a80e0
[ 0.088566] x29: ffff0000093abce0 x28: ffff000008ea9000
[ 0.093979] x27: ffff000008ea9000 x26: ffff0000091f7000
[ 0.099293] x25: ffff00000906d000 x24: ffff000009191000
[ 0.104706] x23: ffff000008ea9000 x22: 0000000041190000
[ 0.110015] x21: ffff0000091f7000 x20: 0000000000000000
[ 0.115428] x19: ffff000009190000 x18: 000000003455d99d
[ 0.120842] x17: 0000000000000001 x16: 00f8000040ffff13
[ 0.126255] x15: 000000007eff6000 x14: 000000007eff6000
[ 0.131566] x13: 00f800007fe00f11 x12: 000000007eff8000
[ 0.136983] x11: 000000007eff8000 x10: 0000000000000000
[ 0.142396] x9 : 000000007eff9000 x8 : 000000007eff9000
[ 0.147704] x7 : 0000000000000000 x6 : 00000000411f8000
[ 0.153116] x5 : 00000000411f8000 x4 : 0000000040a443d4
[ 0.158530] x3 : 00000000411f7000 x2 : 00000000411f7000
[ 0.163943] x1 : ffff00000906d7b0 x0 : ffff80003da61c00
[ 0.169251] Kernel panic - not syncing: kernel stack overflow
[ 0.175140] CPU: 0 PID: 12 Comm: migration/0 Not tainted 4.17.0-45865-gc58dc48 #14
[ 0.182732] Hardware name: linux,dummy-virt (DT)
[ 0.187424] Call trace:
[ 0.189948] dump_backtrace+0x0/0x180
[ 0.193678] show_stack+0x14/0x1c
[ 0.197051] dump_stack+0x90/0xb0
[ 0.200423] panic+0x138/0x2a0
[ 0.203549] __stack_chk_fail+0x0/0x18
[ 0.207398] handle_bad_stack+0x118/0x124
[ 0.211489] __bad_stack+0x88/0x8c
[ 0.214870] el1_sync+0x0/0xb0
[ 0.217998] Unable to handle kernel paging request at virtual address ffff0000093abce0
[ 0.226061] Mem abort info:
[ 0.228839] ESR = 0x96000006
[ 0.231965] Exception class = DABT (current EL), IL = 32 bits
[ 0.237980] SET = 0, FnV = 0
[ 0.241105] EA = 0, S1PTW = 0
[ 0.244346] Data abort info:
[ 0.247239] ISV = 0, ISS = 0x00000006
[ 0.251199] CM = 0, WnR = 0
[ 0.254209] swapper pgtable: 4k pages, 48-bit VAs, pgdp = (ptrval)
[ 0.261191] [ffff0000093abce0] pgd=00000000411f8003, pud=00000000411f9003, pmd=0000000000000000
[ 0.269982] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[ 0.275538] Modules linked in:
[ 0.278664] CPU: 0 PID: 12 Comm: migration/0 Not tainted 4.17.0-45865-gc58dc48 #14
[ 0.286361] Hardware name: linux,dummy-virt (DT)
[ 0.291053] pstate: 204003c5 (nzCv DAIF +PAN -UAO)
[ 0.295874] pc : unwind_frame+0x28/0xc8
[ 0.299836] lr : dump_backtrace+0x12c/0x180
[ 0.304055] sp : ffff80003efcf000
[ 0.307429] x29: ffff80003efcf000 x28: ffff80003da61c00
[ 0.312841] x27: ffff000008ea9000 x26: ffff0000091f7000
[ 0.318255] x25: ffff00000906d000 x24: ffff0000093a80e0
[ 0.323563] x23: 0000000000000000 x22: ffff000008dbada0
[ 0.328975] x21: 0000000000000000 x20: ffff000009049000
[ 0.334388] x19: ffff80003da61c00 x18: 000000003455d99d
[ 0.339698] x17: 0000000000000001 x16: 00f8000040ffff13
[ 0.345111] x15: 000000007eff6000 x14: 3431232038346364
[ 0.350523] x13: 0000000000000000 x12: cc26f77952f87e00
[ 0.355832] x11: ffffffffffffffff x10: 0000000000000075
[ 0.361245] x9 : ffff0000085ae9e8 x8 : 78302f3078302b63
[ 0.366666] x7 : 6e79735f316c6520 x6 : ffff0000091befe1
[ 0.371976] x5 : 0000000000000000 x4 : ffff0000093ac000
[ 0.377389] x3 : ffff0000093a8000 x2 : ffff0000093abce0
[ 0.382801] x1 : ffff80003efcf048 x0 : ffff80003da61c00
[ 0.388214] Process migration/0 (pid: 12, stack limit = 0x (ptrval))
[ 0.395204] Call trace:
[ 0.397726] unwind_frame+0x28/0xc8
[ 0.401224] show_stack+0x14/0x1c
[ 0.404699] dump_stack+0x90/0xb0
[ 0.408070] panic+0x138/0x2a0
[ 0.411198] __stack_chk_fail+0x0/0x18
[ 0.414944] handle_bad_stack+0x118/0x124
[ 0.419035] __bad_stack+0x88/0x8c
[ 0.422520] el1_sync+0x0/0xb0
[ 0.425648] Unable to handle kernel paging request at virtual address ffff0000093abce0
[ 0.433601] Mem abort info:
[ 0.436486] ESR = 0x96000006
[ 0.439611] Exception class = DABT (current EL), IL = 32 bits
[ 0.445626] SET = 0, FnV = 0
[ 0.448754] EA = 0, S1PTW = 0
[ 0.451995] Data abort info:
[ 0.454888] ISV = 0, ISS = 0x00000006
[ 0.458849] CM = 0, WnR = 0
[ 0.461860] swapper pgtable: 4k pages, 48-bit VAs, pgdp = (ptrval)
[ 0.468843] [ffff0000093abce0] pgd=00000000411f8003, pud=00000000411f9003, pmd=0000000000000000


Will

--->8

diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 5f9a73a4452c..e2a8e88f95a0 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -216,7 +216,7 @@ ENDPROC(idmap_cpu_replace_ttbr1)
.endm
.macro __idmap_kpti_put_pgtable_ent_ng, type
- orr \type, \type, #PTE_NG // Same bit for blocks and pages
+ eor \type, \type, #PTE_NG // Same bit for blocks and pages
str \type, [cur_\()\type\()p] // Update the entry and ensure it
dc civac, cur_\()\type\()p // is visible to all CPUs.
.endm
@@ -298,6 +298,7 @@ skip_pgd:
/* PUD */
walk_puds:
.if CONFIG_PGTABLE_LEVELS > 3
+ eor pgd, pgd, #PTE_NG
pte_to_phys cur_pudp, pgd
add end_pudp, cur_pudp, #(PTRS_PER_PUD * 8)
do_pud: __idmap_kpti_get_pgtable_ent pud
@@ -319,6 +320,7 @@ next_pud:
/* PMD */
walk_pmds:
.if CONFIG_PGTABLE_LEVELS > 2
+ eor pud, pud, #PTE_NG
pte_to_phys cur_pmdp, pud
add end_pmdp, cur_pmdp, #(PTRS_PER_PMD * 8)
do_pmd: __idmap_kpti_get_pgtable_ent pmd
@@ -339,6 +341,7 @@ next_pmd:
/* PTE */
walk_ptes:
+ eor pmd, pmd, #PTE_NG
pte_to_phys cur_ptep, pmd
add end_ptep, cur_ptep, #(PTRS_PER_PTE * 8)
do_pte: __idmap_kpti_get_pgtable_ent pte

.