Re: AMD boot woe due to "x86/mm: Cleanup pgprot_4k_2_large() and pgprot_large_2_4k()"

From: Qian Cai
Date: Wed Apr 22 2020 - 14:35:31 EST




> On Apr 22, 2020, at 1:01 PM, Christoph Hellwig <hch@xxxxxx> wrote:
>
> On Wed, Apr 22, 2020 at 11:55:54AM -0400, Qian Cai wrote:
>> Reverted the linux-next commit and its dependency,
>>
>> a85573f7e741 ("x86/mm: Unexport __cachemode2pte_tblâ)
>> 9e294786c89a (âx86/mm: Cleanup pgprot_4k_2_large() and pgprot_large_2_4k()â)
>>
>> fixed crashes or hard reset on AMD machines during boot that have been flagged by
>> KASAN in different forms indicating some sort of memory corruption with this config,
>
> Interesting. Your config seems to boot fine in my VM until the point
> where the lack of virtio-blk support stops it from mounting the root
> file system.
>
> Looking at the patch I found one bug, although that should not affect
> your config (it should use the pgprotval_t type), and one difference
> that could affect code generation, although I prefer the new version
> (use of __pgprot vs a local variable + pgprot_val()).
>
> Two patches attached, can you try them?
> <0001-x86-Use-pgprotval_t-in-protval_4k_2_large-and-pgprot.patch><0002-foo.patch>

Yes, but both patches do not help here. This time flagged by UBSAN,

static void dump_pagetable(unsigned long address)
{
pgd_t *base = __va(read_cr3_pa());
pgd_t *pgd = base + pgd_index(address); <ââ shift-out-of-bounds here

[ 4.452663][ T0] ACPI: LAPIC_NMI (acpi_id[0x73] high level lint[0x1])
[ 4.459391][ T0] ACPI: LAPIC_NMI (acpi_id[0x74] high level lint[0x1])
[ 4.466115][ T0] ACPI: LAPIC_NMI (acpi_id[0x75] high level lint[0x1])
[ 4.472842][ T0] ACPI: LAPIC_NMI (acpi_id[0x76] high level lint[0x1])
[ 4.479567][ T0] ACPI: LAPIC_NMI (acpi_id[0x77] high level lint[0x1])
[ 4.486294][ T0] ACPI: LAPIC_NMI (acpi_id[0x78] high level lint[0x1])
[ 4.493021][ T0] ACPI: LAPIC_NMI (acpi_id[0x79] high level lint[0x1])
[ 4.499745][ T0] ACPI: LAPIC_NMI (acpi_id[0x7a] high level lint[0x1])
[ 4.506471][ T0] ACPI: LAPIC_NMI (acpi_id[0x7b] high level liad access in kernel mode
[ 4.901030][ T0] #PF: error_code(0x0000) - not-present page
[ 4.906884][ T0] BUG: unable to handle page fault for address: ffffed11509c29da
[ 4.914483][ T0] #PF: supervisor read access in kernel mode
[ 4.920334][ T0] #PF: error_code(0x0000) - not-present page
[ 4.926189][ T0] BUG: unable to handle page fault for address: ffffed11509c29da
[ 4.933786][ T0] #PF: supervisor read access in kernel mode
[ 4.939640][ T0] #PF: error_code(0x0000) - not-present page
[ 4.945492][ T0] BUG: unable to handle page fault for address: ffffed11509c29da
[ 4.953091][ T0] #PF: supervisor read access in kernel mode
[ 4.958943][ T0] #PF: error_code(0x0000) - not-present page
[ 4.964797][ T0] BUG: unable to handle page fault for address: ffffed11509c29da
[ 4.972395][ T0] #PF: supervisor read access in kernel mode
[ 4.978247][ T0] #PF: error_code(0x0000) - not-present page
[ 4.984102][ T0] BUG: unable to handle page fault for address: ffffed11509c29da
[ 4.9917age fault for address: ffffed11509c29da
[ 5.481007][ T0] #PF: supervisor read access in kernel mode
[ 5.486862][ T0] #PF: error_code(0x0000) - not-present page
[ 5.492713][ T0] BUG: unable to handle page fault for address: ffffed11509c29da
[ 5.500314][ T0] #PF: supervisor read access in kernel mode
[ 5.506165][ T0] #PF: error_code(0x0000) - not-present page
[ 5.512020][ T0] ================================================================================
[ 5.521193][ T0] UBSAN: shift-out-of-bounds in arch/x86/mm/fault.c:450:22
[ 5.528268][ T0] shift exponent 4294967295 is too large for 64-bit type 'long unsigned int'
[ 5.536916][ T0] CPU: 0 PID: 0 Comm: swapper Tainted: G B 5.7.0-rc2-next-20200422+ #10
[ 5.546434][ T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
[ 5.555692][ T0] Call Trace:
[ 5.558837][ T0] ================================================================================
[ 5.568012][T0] BUG: unable to handle page fault for address: 0000000a2b84dda8
[ 5.961699][ T0] #PF: supervisor read access in kernel mode
[ 5.967550][ T0] #PF: error_code(0x0000) - not-present page
[ 5.973405][ T0] BUG: unable to handle page fault for address: 0000000a2b84dda8
[ 5.981005][ T0] #PF: supervisor read access in kernel mode
[ 5.986856][ T0] #PF: error_code(0x0000) - not-present page
[ 5.992708][ T0] BUG: unable to handle page fault for address: 0000000a2b84dda8
[ 6.000308][ T0] #PF: supervisor read access in kernel mode
[ 6.006159][ T0] #PF: error_code(0x0000) - not-present page
[ 6.012013][ T0] BUG: unable to handle page fault for address: 0000000a2b84dda8
[ 6.019612][ T0] #PF: supervisor read access in kernel mode