Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX

From: Binbin Wu
Date: Wed Jul 30 2025 - 03:34:37 EST




On 7/28/2025 11:33 PM, Sean Christopherson wrote:
+Jiewen

Jiewen is out of the office until August 4th.


Summary, with the questions at the end.

Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
the TPM driver's ioremap (with UC) failing because the kernel has already mapped
the range using a cachaeable mapping (WB).

ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12

The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
to WB. With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
actually honors the memtypes programmed into the virtual MTRRs.

It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
region (and potentially other regions) to be UC, so that the kernel ACPI driver's
attempts to map of SystemMemory entries as cacheable get forced to UC. With MTRRs
forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
causes the ioremap infrastructure to reject the TPM driver's UC mapping.

IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
EDK2's TPM ASL. And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]

//
// Operational region for TPM access
//
OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)

generates the problematic SystemMemory entry that triggers the ACPI driver's
auto-mapping logic.

QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
Memory32Fixed entry.

Presumably this an EDK2 bug? If it's not an EDK2 bug, then how is the kernel's
ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
According to the ACPI spec 6.6, an operation region of SystemMemory has no
interface to specify the cacheable attribute.

One solution could be using MTRRs to communicate the memory attribute of legacy
PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
that "long-term, firmware should not be using MTRRs to communicate anything to
the kernel." So this solution is not preferred.

If not MTRRs, there should be an alternative way to do the job.
1. ACPI table
   According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
   Range Descriptor can specify the cacheable attribute.
   "Address Space Resource Descriptors" could be used to describe a memory range
   and the they can specify the cacheable attribute via "Type Specific Flags".
   One of the Address Space Resource Descriptors could be added to the ACPI
   table as a hint when the kernel do the mapping for operation region.
   (There is "System Physical Address (SPA) Range Structure", which also can
   specify the cacheable attribute. But it's should be used for NVDIMMs.)
2. EFI memory map descriptor
   EFI memory descriptor can specify the cacheable attribute. Firmware can add
   a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
   the mapping for operation region.

Operation region of SystemMemory is still needed if a "Control Method" of APCI
needs to access a field, e.g., the method _STA. Checking another descriptor for
cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
map descriptor" during the ACPI code doing the mapping for operation region
makes the code complicated.

Another thing is if long-term firmware should not be using MTRRs to to
communicate anything to the kernel. It seems it's safer to use ioremap() instead
of ioremap_cache() for MMIO resource when the kernel do the mapping for the
operation region access?


[1] https://github.com/tianocore/edk2/blob/master/SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl#L53
[2] https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03397.html

On Thu, Jul 24, 2025, Binbin Wu wrote:
On 7/23/2025 10:34 PM, Sean Christopherson wrote:
On Mon, Jul 14, 2025, Nikolay Borisov wrote:
On 14.07.25 г. 12:06 ч., Binbin Wu wrote:
On 7/10/2025 12:54 AM, Jianxiong Gao wrote:
I tested this patch on top of commit 8e690b817e38, however we are
still experiencing the same failure.

I didn't reproduce the issue with QEMU.
After some comparison on how QEMU building the ACPI tables for HPET and
TPM,

- For HPET, the HPET range is added as Operation Region:
    aml_append(dev,
        aml_operation_region("HPTM", AML_SYSTEM_MEMORY,
aml_int(HPET_BASE),
                             HPET_LEN));

- For TPM, the range is added as 32-Bit Fixed Memory Range:
    if (TPM_IS_TIS_ISA(tpm_find())) {
        aml_append(crs, aml_memory32_fixed(TPM_TIS_ADDR_BASE,
                   TPM_TIS_ADDR_SIZE, AML_READ_WRITE));
    }

So, in KVM, the code patch of TPM is different from the trace for HPET in
the patch https://lore.kernel.org/kvm/20250201005048.657470-3-seanjc@xxxxxxxxxx/,
HPET will trigger the code path acpi_os_map_iomem(), but TPM doesn't.
Argh, I was looking at the wrong TPM resource when poking through QEMU. I peeked
at TPM_PPI_ADDR_BASE, which gets an AML_SYSTEM_MEMORY entry, not TPM_TIS_ADDR_BASE.
...

I guess google has defined a ACPI method to access the region for TPM TIS during
ACPI device probe.

In the meantime, can someone who has reproduced the real issue get backtraces to
confirm or disprove that acpi_os_map_iomem() is trying to map the TPM TIS range
as WB? E.g. with something like so:
Got confirmation off-list that Google's ACPI tables due trigger the kernel's
cachable mapping logic for SYSTEM_MEMORY.

Mapping TPM TIS with req_type = 0
WARNING: CPU: 22 PID: 1 at arch/x86/mm/pat/memtype.c:530 memtype_reserve+0x2ab/0x460
Modules linked in:
CPU: 22 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.16.0-rc7+ #2 VOLUNTARY
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/29/2025
RIP: 0010:memtype_reserve+0x2ab/0x460
__ioremap_caller+0x16d/0x3d0
ioremap_cache+0x17/0x30
x86_acpi_os_ioremap+0xe/0x20
acpi_os_map_iomem+0x1f3/0x240
acpi_os_map_memory+0xe/0x20
acpi_ex_system_memory_space_handler+0x273/0x440
acpi_ev_address_space_dispatch+0x176/0x4c0
acpi_ex_access_region+0x2ad/0x530
acpi_ex_field_datum_io+0xa2/0x4f0
acpi_ex_extract_from_field+0x296/0x3e0
acpi_ex_read_data_from_field+0xd1/0x460
acpi_ex_resolve_node_to_value+0x2ee/0x530
acpi_ex_resolve_to_value+0x1f2/0x540
acpi_ds_evaluate_name_path+0x11b/0x190
acpi_ds_exec_end_op+0x456/0x960
acpi_ps_parse_loop+0x27a/0xa50
acpi_ps_parse_aml+0x226/0x600
acpi_ps_execute_method+0x172/0x3e0
acpi_ns_evaluate+0x175/0x5f0
acpi_evaluate_object+0x213/0x490
acpi_evaluate_integer+0x6d/0x140
acpi_bus_get_status+0x93/0x150
acpi_add_single_object+0x43a/0x7c0
acpi_bus_check_add+0x149/0x3a0
acpi_bus_check_add_1+0x16/0x30
acpi_ns_walk_namespace+0x22c/0x360
acpi_walk_namespace+0x15c/0x170
acpi_bus_scan+0x1dd/0x200
acpi_scan_init+0xe5/0x2b0
acpi_init+0x264/0x5b0
do_one_initcall+0x5a/0x310
kernel_init_freeable+0x34f/0x4f0
kernel_init+0x1b/0x200
ret_from_fork+0x186/0x1b0
ret_from_fork_asm+0x1a/0x30
</TASK>

I tried to add an AML_SYSTEM_MEMORY entry as operation region in the ACPI
table and modify the _STA method to access the region for TPM TIS in QEMU, then
the issue can be reproduced.

diff --git a/hw/tpm/tpm_tis_isa.c b/hw/tpm/tpm_tis_isa.c
index 876cb02cb5..aca2b2993f 100644
--- a/hw/tpm/tpm_tis_isa.c
+++ b/hw/tpm/tpm_tis_isa.c
@@ -143,6 +143,9 @@ static void build_tpm_tis_isa_aml(AcpiDevAmlIf *adev, Aml *scope)
     Aml *dev, *crs;
     TPMStateISA *isadev = TPM_TIS_ISA(adev);
     TPMIf *ti = TPM_IF(isadev);
+    Aml *field;
+    Aml *method;
+    Aml *test = aml_local(0);

     dev = aml_device("TPM");
     if (tpm_tis_isa_get_tpm_version(ti) == TPM_VERSION_2_0) {
@@ -152,7 +155,19 @@ static void build_tpm_tis_isa_aml(AcpiDevAmlIf *adev, Aml *scope)
         aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0C31")));
     }
     aml_append(dev, aml_name_decl("_UID", aml_int(1)));
-    aml_append(dev, aml_name_decl("_STA", aml_int(0xF)));
+
+    aml_append(dev, aml_operation_region("TPMM", AML_SYSTEM_MEMORY, aml_int(TPM_TIS_ADDR_BASE),
+                         TPM_TIS_ADDR_SIZE));
+
+    field = aml_field("TPMM", AML_DWORD_ACC, AML_LOCK, AML_PRESERVE);
+    aml_append(field, aml_named_field("TEST", 32));
+    aml_append(dev, field);
+
+    method = aml_method("_STA", 0, AML_NOTSERIALIZED);
+    aml_append(method, aml_store(aml_name("TEST"), test));
+    aml_append(method, aml_return(aml_int(0xF)));
+    aml_append(dev, method);