Re: [perf] unchecked MSR access error: WRMSR to 0x689 in intel_pmu_lbr_restore

From: Pawan Gupta
Date: Mon Jul 11 2022 - 18:17:05 EST


On Mon, Jul 11, 2022 at 11:25:34AM -0400, Liang, Kan wrote:


On 2022-07-08 12:13 p.m., Vince Weaver wrote:
On Wed, 6 Jul 2022, Vince Weaver wrote:

Let the fuzzer running a long time on 5.19-rc1 and after a few weeks it
triggered this weird trace. It is repeatable (although I haven't
narrowed down exactly what's causing it).

It's odd in that it just dumps a <TASK>, it doesn't provide any info on
what the actual trigger is.

This is on a Haswell machine.

I bumped up to current git and managed to trigger this again, this time
it actually managed to print the error message.

[ 7763.384369] unchecked MSR access error: WRMSR to 0x689 (tried to write 0x1fffffff8101349e) at rIP: 0xffffffff810704a4 (native_write_msr+0x4/0x20)

The 0x689 is a valid LBR register, which is MSR_LASTBRANCH_9_FROM_IP.
The issue should be caused by the known TSX bug, which is mentioned in
the commit 9fc9ddd61e0 ("perf/x86/intel: Fix MSR_LAST_BRANCH_FROM_x bug
when no TSX"). It looks like the TSX support has been deactivated,
however the quirk in the commit isn't applied for some reason.


To apply the quirk, perf relies on the boot CPU's flag and LBR format.

static inline bool lbr_from_signext_quirk_needed(void)
{
bool tsx_support = boot_cpu_has(X86_FEATURE_HLE) ||
boot_cpu_has(X86_FEATURE_RTM);

return !tsx_support && x86_pmu.lbr_has_tsx;
}

Could you please share the value of the PERF_CAPABILITIES MSR 0x00000345
of the machine?
I'd like to double check whether the LBR fromat is correct. 0x5 is expected.


If the LBR format is correct, maybe the boot CPU's flag is not cleared
when the TSX support is deactivated.
I noticed that Pawan recently had several TSX patches merged which may
impact the flags.
400331f8ffa3 ("x86/tsx: Disable TSX development mode at boot")
258f3b8c3210 ("x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits")
If you only observe the issue with the latest kernel, you may want to
revert the above two patches and see if it helps.

Output of below would be helpful:

# grep "rtm\|hle" /proc/cpuinfo

ARCH_CAP
# rdmsr 0x10a

TSX_CTRL
# rdmsr 0x122

MCU_OPT_CTRL
# rdmsr 0x123

TSX_FORCE_ABORT
# rdmsr 0x10f

Please note, some of these MSRs may not exist on your platform.

Thanks,
Pawan