Re: Kernel 5.3.x, 5.2.2+: VMware player suspend on 64/32 bit guests

From: Woody Suwalski
Date: Mon Aug 12 2019 - 10:42:59 EST


Thomas, Rafael,
I have added a timeout counter in __synchronize_hardirq().
At the bottom I have converted while(inprogress); to while(inprogress
&& timeout++ < 100);

That is bypassing the suspend lockup problem. On both 32-bit and
64-bit VMs the countdown is triggered by sync of irq9.
Which probably means that there is some issue in ACPI handler and
synchronize_hardirq() is stuck on it?
I will try to repeat with 5.3-rc4 tomorrow....

Thanks, Woody

On Sat, Aug 10, 2019 at 7:24 AM Woody Suwalski <terraluna977@xxxxxxxxx> wrote:
>
> Moving the thread to LKML, as suggested by Thomas...
> >
> >> ---------- Forwarded message ---------
> >> From: Woody Suwalski <terraluna977@xxxxxxxxx>
> >> Date: Thu, Aug 1, 2019 at 3:45 PM
> >> Subject: Intermittent suspend on 5.3 / 5.2
> >> To: Rafael J. Wysocki <rjw@xxxxxxxxxxxxx>
> >>
> >>
> >> Hi RafaÅ,
> >> I know that you are investigating some issues between these 2 kernels,
> >> however I see probably an unrelated problem with suspend on 5.3 and
> >> 5.2.4. I think it has creeped in to 5.1.21 as well, but not sure (it is
> >> intermittent). So far 4.20.17 works OK, and I think 5.2.0 works OK.
> >> The problem I see is on both 32 and 64 bit VMs, in VMware workstation
> >> 15. The VM is trying to suspend when no activity. It leaves out a black
> >> box with cursor in top-left position. Upon wakeup from VMware it goes to
> >> vmware pre-bios screen, and then expands the black box to the run-size
> >> and switches to X.
> >> The problem with new kernels is that (I think) the suspend fails - the
> >> black box with cursor is there, but seems bigger, and of course is not
> >> wake'able (have to reset). In kern.log suspend seems be running OK, and
> >> then new dmesg lines kick in, and no obvious culprit.
> >> So looking for a free advice .
> >> a. You already know what it is
> >> b. You may have suggestions as to which upstream patch could be to blame
> >> c. I should boot with some debug params (console_off=0, or some other?)
> >> and get some real info?
> >>
> >> BTW. For suspend to work I had to override mem_sleep to [shallow], or
> >> maybe later to [s2idle] (the actual VMs are at work, referring from
> >> memory...)
> >>
> >> If you have any ideas, all are welcomed
> >> Thanks, Woody
>
>
>
> On 8/6/2019 3:18 PM, Woody Suwalski wrote:
> > Rafal, the patch (in 5.3-rc3)
> >
> > Fixes: f850a48a0799 ("ACPI: PM: Allow transitions to D0 to occur in
> > special cases")
> >
> > does not fix the issue - it must be something else...
>
> Sorry for the late response.
>
> There are known issues in 5.3-rc related to power management which
> should be fixed in -rc4. Please try that one when it is out.
>
> Cheers!
>
>
>
> Thomas Gleixner wrote:
> > Woody,
> >
> > On Fri, 9 Aug 2019, Woody Suwalski wrote:
> >
> > For future things like this, please CC LKML. There is nothing secrit here
> > and CC'ing the mailing list allows other people to find this and spare
> > themself the whole bisection pain. Asided of that private mail does not
> > scale. On the list other people can look at it and give input eventually.
> >
> >> After bisecting I have found the potential culprit:
> >> dfe0cf8b x86/ioapic: Implement irq_get irqchip_state() callback
> >>
> >> I am repeating the bisection from start to re-confirm.
> >>
> >> Reverse-patch on 5.3-rc3 (64bit) is fixing the problem for me.
> >> What is unclear - just adding the patch to 5.2.1 does not seem to
> >> break it. So there is some more magic involved.
> > Of course it does not do anything because 5.2.1 is not having
> >
> > f4999a2a3a48 ("genirq: Add optional hardware synchronization for shutdown")
> >
> >> Thomas, any suggestions?
> > What that means is that there is an interrupt shutdown which hits the
> > condition where an interrupt _IS_ marked in the IOAPIC as delivered to a
> > CPU, but not serviced yet.
> >
> > Now the question is why it is not serviced. suspend_device_irqs() is
> > calling into synchronize_irq(), which is probably the place where that
> > it hangs. But that's called with CPUs online and interrupts enabled.
> >
> >> The reproduce methodology: use VMware player 15, either 32 or 64 bit build.
> >> reboot and run "systemctl suspend". The first suspend works OK. The
> >> second usually locks on kernels 5.2.2 and up. Maybe try 4 times to
> >> confirm good (it is intermittent).
> > -ENOVMWAREPLAYER and I'm traveling so I don't have a machine handy to
> > install it. So if you can't debug it deeper down, I'm not going to have a
> > chance to look at it before the end of next week.
> >
> > That said, can we please move this to LKML?
> >
> > Thanks,
> >
> > tglx
> >
> >
> I can add some printk's into synchronize_irq(), however no idea if they
> will be survive in the kmsg log after a next power-reset. I can wait for
> a week :-)
>
> Thanks, Woody
>