Re: [PATCH] x86/resctrl: avoid divide by 0 num_rmid

From: Barret Rhoden
Date: Tue Jul 22 2025 - 14:51:48 EST


On 7/22/25 2:19 PM, Reinette Chatre wrote:
Hi Barret,

On 7/21/25 11:00 AM, Barret Rhoden wrote:
x86_cache_max_rmid's default is -1. If the hardware or VM doesn't set
the right cpuid bits, num_rmid can be 0.

Signed-off-by: Barret Rhoden <brho@xxxxxxxxxx>

---
I ran into this on a VM on granite rapids. I guess the VMM told the
kernel it was a GNR, but didn't set all the cache/rsctl bits.


The -1 default of x86_cache_max_rmid is assigned if the hardware does not
support *any* L3 monitoring. Specifically:

resctrl_cpu_detect():
if (!cpu_has(c, X86_FEATURE_CQM_LLC)) {
c->x86_cache_max_rmid = -1;
...
}

The function modified by this patch, rdt_get_mon_l3_config() only runs if
the hardware supports one or more of the L3 monitoring sub-features
(X86_FEATURE_CQM_OCCUP_LLC, X86_FEATURE_CQM_MBM_TOTAL, or
X86_FEATURE_CQM_MBM_LOCAL) that depend on X86_FEATURE_CQM_LLC per cpuid_deps[].

I tried to reproduce the issue on real hardware by using clearcpuid to
disable X86_FEATURE_CQM_LLC and the CPUID dependencies did the right thing
by automatically disabling X86_FEATURE_CQM_OCCUP_LLC, X86_FEATURE_CQM_MBM_TOTAL,
X86_FEATURE_CQM_MBM_LOCAL, not running rdt_get_mon_l3_config() at all, and
not even attempt to enumerate any of the L3 monitoring details.

What are the symptoms when you encounter this issue?

Linux crashes during boot with a divide error, and the splat backtrace is in rdt_get_mon_l3_config().

Would it be possible to send me the CPUID flags of leaf 7, subleaf 0 as
well as all sub-leaves of leaf 0xF?

# ./cpuid 0x7 0
CPUID for Leaf 0x00000007, Sublevel 0x00000000:
eax: 00000002
ebx: f1bf2ffb
ecx: 1b415f7e
edx: bc814410

# ./cpuid 0x7 1
CPUID for Leaf 0x00000007, Sublevel 0x00000001:
eax: 00201c30
ebx: 00000000
ecx: 00000000
edx: 00084000

# ./cpuid 0x7 2
CPUID for Leaf 0x00000007, Sublevel 0x00000002:
eax: 00000000
ebx: 00000000
ecx: 00000000
edx: 0000003f

Could you please also elaborate what the impact of this issue is? Is this
a VM that has been released with many users impacted or something encountered
during development of this VM?

This is with cloud-hypervisor. We do have a couple of local patches for running on machines with more than 256 cpus. I didn't see anything in our changes related to cpuid 0x7, but maybe it's on our end.

But I imagine the problem isn't widespread and could be considered developmental.

I'll keep poking on my end - maybe I had some other cruft in my system (in the kernel build or in cloud_hypervisor).

Thanks,
Barret