Re: [Patch v3 Part2 3/9] x86/microcode/intel: Fix collect_cpu_info() to reflect current microcode

From: Ashok Raj
Date: Tue Jan 31 2023 - 17:33:04 EST


On Tue, Jan 31, 2023 at 10:08:48PM +0100, Borislav Petkov wrote:
> On Tue, Jan 31, 2023 at 08:49:52PM +0000, Luck, Tony wrote:
> > What happens here if the update on the first hyperthread failed (sure, it shouldn't,
> > but stuff happens at large scale). In this case the current rev is still older that the
> > the cache version ... so there is no "goto out", and this hyperthread will now write
> > the MSR to initiate microcode update here, while the first thread is off executing
> > arbitrary code (the situation that we want to avoid).
>
> Lemme see if I can follow: we sync all threads in __reload_late() and
> once they all arrive, we send them down into ->apply_microcode.

I was interpreting Thomas's response here

https://lore.kernel.org/lkml/87y1pygiyf.ffs@tglx/

---------
#3

Not to talk about the completely broken error handling in the actual
microcode loading case in __reload_late()::wait_for_siblings code path.
---------

I thought sending sibling down the apply_microcode() just to update
revision might have bad interaction, so revise this to be more explicit and
update only the revision number and not have any other intended side
effect.

Patch3 was a prep-patch for patch4, to eliminate the call to
apply_microcode(). Maybe that wasn't the issue?

It's likely Thomas hinted at somthing else and I took the wrong hint.

>
> T0 arrives, and fails the update. That is this piece:
>
> /* write microcode via MSR 0x79 */
> wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);
>
> rev = intel_get_microcode_revision();
>
> if (rev != mc->hdr.rev) {
> pr_err("CPU%d update to revision 0x%x failed\n",
> cpu, mc->hdr.rev);
> return UCODE_ERROR;
> }
>
> We return here without updating cpu_sig.rev, as we should.
>
> T1 arrives, updates successfully and updates its cpu_sig.rev.
>
> T0's patch level has been updated too with that because the microcode
> engine is shared between the threads. T0's cpu_sig.rev isn't, however,
> as that has happened "behind its back", so to speak.
>
> Is that the scenario you're talking about?

The fix in patch4 was just attempting to not call apply_microcode(), since
at this time the other CPUs that arrived in wait_for_siblings: have left,
doing an update when the others are strictly not in renedzvous is the
condition we try to avoid.

Again, maybe this is me reading Thomas's request incorrectly.

Cheers,
Ashok