Re: Intermittent Qemu boot hang/regression traced back to INT 0x80 changes
From: Paul Gortmaker
Date: Sun May 12 2024 - 16:23:25 EST
[Re: Intermittent Qemu boot hang/regression traced back to INT 0x80 changes] On 26/04/2024 (Fri 08:24) Paul Gortmaker wrote:
> [Re: Intermittent Qemu boot hang/regression traced back to INT 0x80 changes] On 24/04/2024 (Wed 21:51) Borislav Petkov wrote:
>
> > On Wed, Apr 24, 2024 at 02:58:06PM -0400, Paul Gortmaker wrote:
> > ...
> > > pci 0000:00:1d.0: [8086:2934] type 00 class 0x0c0300 conventional PCI endpoint
> > > pci 0000:00:1d.0: BAR 4 [io 0xc080-0xc09f]
> > > pci 0000:00:1d.1: [8086:2935] type 00 class 0x0c0300 conventional PCI endpoint
> > > pci 0000:00:1d.1: BAR 4 [io 0xc0a0-0xc0bf]
> > > pci 0000:00:1d.2: [8086:2936] type 00 class 0x0c0300 conventional PCI endpoint
> > > <hang - not always exactly here, but always in this block of PCI printk>
> >
[...]
> So I owe you guys an apology for pointing the finger at INT80. I still
> don't understand how the pseudo bisect on v6.6-stable seems so
> "concrete". The v6.6.6 worked "fine" (it seemed) and v6.6.7 died fairly
> quickly. The revert of INT80 on v6.6.7 seemed to "fix" it - but if so,
> it was only because it perturbed something else.
With hindsight, it is pretty clear the kernel image changes/alignment
were doing exactly that - triggering a dormant issue in QEMU.
> I want to try some of these things, but I also don't want to
> accidentally lose the reproducer I have. Maybe I'll see if I can
> reproduce it at home, since I'll lose use of the current box in a week
> anyoway...
So I did reproduce it at home, and once I got off the shared server and
onto my own stuff, I could prove Boris was right in suspecting QEMU.
> Again, sorry for the false positive. I let the v6.6-stable testing bias
> my mainline conclusions to where I didn't test underneath INT80. I'll
> follow up with more details once (if?) I manage to properly sort this.
Turns out, with my own stuff, and dmesg not being locked down (annoying)
I found that there was a 1:1 correlation between a PCI hang and this:
qemu-system-i38[758683]: segfault at 7f7378b02 ip 0000557a5051cec4 sp 00007f7383dfe0e0 error 4 in qemu-system-i386[557a5019e000+5b0000]
Code: 84 00 00 00 00 00 41 55 49 89 cd 41 54 49 89 d4 55 48 89 fd 53 44 89 c3 48 83 ec 08 48 8b 07 48 85 c0 74 22 48 3b 47 38 74 1c <48> 83 78 08 00 48 8b 10 75 1e 48 8b 48 28 48 39 ce 0f 83 a5
..appearing in the dmesg output. Pretty hard to argue against letting
non-KVM QEMU own 100% of the blame for this one.
Paul.
--