RE: [PATCH] Drivers: hv: vmbus: handle various crash scenarios

From: KY Srinivasan
Date: Tue Mar 22 2016 - 10:18:20 EST




> -----Original Message-----
> From: Vitaly Kuznetsov [mailto:vkuznets@xxxxxxxxxx]
> Sent: Tuesday, March 22, 2016 7:01 AM
> To: KY Srinivasan <kys@xxxxxxxxxxxxx>
> Cc: devel@xxxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Haiyang
> Zhang <haiyangz@xxxxxxxxxxxxx>; Alex Ng (LIS) <alexng@xxxxxxxxxxxxx>;
> Radim Krcmar <rkrcmar@xxxxxxxxxx>; Cathy Avery <cavery@xxxxxxxxxx>
> Subject: Re: [PATCH] Drivers: hv: vmbus: handle various crash scenarios
>
> KY Srinivasan <kys@xxxxxxxxxxxxx> writes:
>
> >> -----Original Message-----
> >> From: Vitaly Kuznetsov [mailto:vkuznets@xxxxxxxxxx]
> >> Sent: Monday, March 21, 2016 12:52 AM
> >> To: KY Srinivasan <kys@xxxxxxxxxxxxx>
> >> Cc: devel@xxxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Haiyang
> >> Zhang <haiyangz@xxxxxxxxxxxxx>; Alex Ng (LIS)
> <alexng@xxxxxxxxxxxxx>;
> >> Radim Krcmar <rkrcmar@xxxxxxxxxx>; Cathy Avery
> <cavery@xxxxxxxxxx>
> >> Subject: Re: [PATCH] Drivers: hv: vmbus: handle various crash scenarios
> >>
> >> KY Srinivasan <kys@xxxxxxxxxxxxx> writes:
> >>
> >> >> -----Original Message-----
> >> >> From: Vitaly Kuznetsov [mailto:vkuznets@xxxxxxxxxx]
> >> >> Sent: Friday, March 18, 2016 5:33 AM
> >> >> To: devel@xxxxxxxxxxxxxxxxxxxxxx
> >> >> Cc: linux-kernel@xxxxxxxxxxxxxxx; KY Srinivasan <kys@xxxxxxxxxxxxx>;
> >> >> Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; Alex Ng (LIS)
> >> >> <alexng@xxxxxxxxxxxxx>; Radim Krcmar <rkrcmar@xxxxxxxxxx>;
> Cathy
> >> >> Avery <cavery@xxxxxxxxxx>
> >> >> Subject: [PATCH] Drivers: hv: vmbus: handle various crash scenarios
> >> >>
> >> >> Kdump keeps biting. Turns out CHANNELMSG_UNLOAD_RESPONSE is
> >> always
> >> >> delivered to CPU0 regardless of what CPU we're sending
> >> >> CHANNELMSG_UNLOAD
> >> >> from. vmbus_wait_for_unload() doesn't account for the fact that in
> case
> >> >> we're crashing on some other CPU and CPU0 is still alive and
> operational
> >> >> CHANNELMSG_UNLOAD_RESPONSE will be delivered there
> completing
> >> >> vmbus_connection.unload_event, our wait on the current CPU will
> never
> >> >> end.
> >> >
> >> > What was the host you were testing on?
> >> >
> >>
> >> I was testing on both 2012R2 and 2016TP4. The bug is easily reproducible
> >> by forcing crash on a secondary CPU, e.g.:
> >
> > Prior to 2012R2, all messages would be delivered on CPU0 and this includes
> CHANNELMSG_UNLOAD_RESPONSE.
> > For this reason we don't support kexec on pre-2012 R2 hosts. On 2012.
> From 2012 R2 on, all vmbus
> > messages (responses) will be delivered on the CPU that we initially set up -
> look at the code in
> > vmbus_negotiate_version(). So on post 2012 R2 hosts, the response to
> CHANNELMSG_UNLOAD_RESPONSE
> > will be delivered on the CPU where we initiate the contact with the
> > host - CHANNELMSG_INITIATE_CONTACT message.
>
> Unfortunatelly there is a descrepancy between WS2012R2 and WS2016TP4.
> On
> WS2012R2 what you're saying is true and all messages including
> CHANNELMSG_UNLOAD_RESPONSE are delivered to the CPU we used for
> initial
> contact. On WS2016TP4 CHANNELMSG_UNLOAD_RESPONSE seems to be a
> special
> case and it is always delivered to CPU0, no matter which CPU we used for
> initial contact. This can be a host bug. You can use the attached patch
> to see the issue.

This looks like a host bug and I will try to get is addressed before ws2016
ships.
>
> For now I can suggest we check message pages for all CPUs from
> vmbus_wait_for_unload(). We can race with other CPUs again but we don't
> care as we're checking for completion_done() in the loop as well. I'll
> try this approach.
Thank you.

K. Y

>
> --
> Vitaly