Re: ** POTENTIAL FRAUD ALERT - RED HAT ** [PATCH v2 1/1] Drivers: hv: vmbus: Increase wait time for VMbus unload

From: Wei Liu
Date: Tue Apr 20 2021 - 15:46:52 EST


On Tue, Apr 20, 2021 at 11:31:54AM +0200, Vitaly Kuznetsov wrote:
> Michael Kelley <mikelley@xxxxxxxxxxxxx> writes:
>
> > When running in Azure, disks may be connected to a Linux VM with
> > read/write caching enabled. If a VM panics and issues a VMbus
> > UNLOAD request to Hyper-V, the response is delayed until all dirty
> > data in the disk cache is flushed. In extreme cases, this flushing
> > can take 10's of seconds, depending on the disk speed and the amount
> > of dirty data. If kdump is configured for the VM, the current 10 second
> > timeout in vmbus_wait_for_unload() may be exceeded, and the UNLOAD
> > complete message may arrive well after the kdump kernel is already
> > running, causing problems. Note that no problem occurs if kdump is
> > not enabled because Hyper-V waits for the cache flush before doing
> > a reboot through the BIOS/UEFI code.
> >
> > Fix this problem by increasing the timeout in vmbus_wait_for_unload()
> > to 100 seconds. Also output periodic messages so that if anyone is
> > watching the serial console, they won't think the VM is completely
> > hung.
> >
> > Fixes: 911e1987efc8 ("Drivers: hv: vmbus: Add timeout to vmbus_wait_for_unload")
> > Signed-off-by: Michael Kelley <mikelley@xxxxxxxxxxxxx>

Applied to hyperv-next. Thanks.

> > ---
[...]
> >
> > +#define UNLOAD_DELAY_UNIT_MS 10 /* 10 milliseconds */
> > +#define UNLOAD_WAIT_MS (100*1000) /* 100 seconds */
> > +#define UNLOAD_WAIT_LOOPS (UNLOAD_WAIT_MS/UNLOAD_DELAY_UNIT_MS)
> > +#define UNLOAD_MSG_MS (5*1000) /* Every 5 seconds */
> > +#define UNLOAD_MSG_LOOPS (UNLOAD_MSG_MS/UNLOAD_DELAY_UNIT_MS)
> > +
> > static void vmbus_wait_for_unload(void)
> > {
> > int cpu;
> > @@ -772,12 +778,17 @@ static void vmbus_wait_for_unload(void)
> > * vmbus_connection.unload_event. If not, the last thing we can do is
> > * read message pages for all CPUs directly.
> > *
> > - * Wait no more than 10 seconds so that the panic path can't get
> > - * hung forever in case the response message isn't seen.
> > + * Wait up to 100 seconds since an Azure host must writeback any dirty
> > + * data in its disk cache before the VMbus UNLOAD request will
> > + * complete. This flushing has been empirically observed to take up
> > + * to 50 seconds in cases with a lot of dirty data, so allow additional
> > + * leeway and for inaccuracies in mdelay(). But eventually time out so
> > + * that the panic path can't get hung forever in case the response
> > + * message isn't seen.
>
> I vaguely remember debugging cases when CHANNELMSG_UNLOAD_RESPONSE never
> arrives, it was kind of pointless to proceed to kexec as attempts to
> reconnect Vmbus devices were failing (no devices were offered after
> CHANNELMSG_REQUESTOFFERS AFAIR). Would it maybe make sense to just do
> emergency reboot instead of proceeding to kexec when this happens? Just
> wondering.
>

Please submit a follow-up patch if necessary.

Wei.