Re: FreeNAS VM disk access errors, bisected to commit 6f1a4891a592

From: Marc Dionne
Date: Fri Apr 17 2020 - 20:49:55 EST


On Fri, Apr 17, 2020 at 5:19 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> Marc,
>
> Marc Dionne <marc.c.dionne@xxxxxxxxx> writes:
>
> > Commit 6f1a4891a592 ("x86/apic/msi: Plug non-maskable MSI affinity
> > race") causes Linux VMs hosted on FreeNAS (bhyve hypervisor) to lose
> > access to their disk devices shortly after boot. The disks are zfs
> > zvols on the host, presented to each VM.
> >
> > Background: I recently updated some fedora 31 VMs running under the
> > bhyve hypervisor (hosted on a FreeNAS mini), and they moved to a
> > distro 5.5 kernel (5.5.15). Shortly after reboot, the disks became
> > inaccessible with any operation getting EIO errors. Booting back into
> > a 5.4 kernel, everything was fine. I built a 5.7-rc1 kernel, which
> > showed the same symptoms, and was then able to bisect it down to
> > commit 6f1a4891a592. Note that the symptoms do not occur on every
> > boot, but often enough (roughly 80%) to make bisection possible.
> >
> > Applying a manual revert of 6f1a4891a592 on top of mainline from
> > yesterday gives me a kernel that works fine.
>
> we tested on real hardware and various hypervisors that the fix actually
> works correctly.
>
> That makes me assume that the staged approach of changing affinity for
> this non-maskable MSI mess makes your particular hypervisor unhappy.
>
> Are there any messages like this:
>
> "do_IRQ: 0.83 No irq handler for vector"

I haven't seen those although I only have a VNC console that scrolls
by rather fast.
I did see a report from someone running Ubuntu 18.04 which had this
after the initial errors:

do_IRQ: 2.35 No irq handler for vector
ata1.00: revalidation failed (error=-5)

> in dmesg on the Linux side? If they happen then before the disk timeout
> happens.
>
> I have absolutely zero knowledge about bhyve, so may I suggest to talk
> to the bhyve experts about this.

I opened a ticket with iXsystems. I noticed several people reporting
the same problem in their community forums.

Marc