Re: [Xen-devel] [PATCH v2] xen: Fix x86 sched_clock() interface for xen

From: Hans van Kranenburg
Date: Fri Jan 11 2019 - 16:46:31 EST


On 1/11/19 4:57 PM, Hans van Kranenburg wrote:
> On 1/11/19 3:01 PM, Juergen Gross wrote:
>> On 11/01/2019 14:12, Hans van Kranenburg wrote:
>>> Hi,
>>>
>>> On 1/11/19 1:08 PM, Juergen Gross wrote:
>>>> Commit f94c8d11699759 ("sched/clock, x86/tsc: Rework the x86 'unstable'
>>>> sched_clock() interface") broke Xen guest time handling across
>>>> migration:
>>>>
>>>> [ 187.249951] Freezing user space processes ... (elapsed 0.001 seconds) done.
>>>> [ 187.251137] OOM killer disabled.
>>>> [ 187.251137] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
>>>> [ 187.252299] suspending xenstore...
>>>> [ 187.266987] xen:grant_table: Grant tables using version 1 layout
>>>> [18446743811.706476] OOM killer enabled.
>>>> [18446743811.706478] Restarting tasks ... done.
>>>> [18446743811.720505] Setting capacity to 16777216
>>>>
>>>> Fix that by setting xen_sched_clock_offset at resume time to ensure a
>>>> monotonic clock value.
>>>>
>>>> [...]
>>>
>>> I'm throwing around a PV domU over a bunch of test servers with live
>>> migrate now, and in between the kernel logging, I'm seeing this:
>>>
>>> [Fri Jan 11 13:58:42 2019] Freezing user space processes ... (elapsed
>>> 0.002 seconds) done.
>>> [Fri Jan 11 13:58:42 2019] OOM killer disabled.
>>> [Fri Jan 11 13:58:42 2019] Freezing remaining freezable tasks ...
>>> (elapsed 0.000 seconds) done.
>>> [Fri Jan 11 13:58:42 2019] suspending xenstore...
>>> [Fri Jan 11 13:58:42 2019] ------------[ cut here ]------------
>>> [Fri Jan 11 13:58:42 2019] Current state: 1
>>> [Fri Jan 11 13:58:42 2019] WARNING: CPU: 3 PID: 0 at
>>> kernel/time/clockevents.c:133 clockevents_switch_state+0x48/0xe0
>>> [...]
>>>
>>> This always happens on every *first* live migrate that I do after
>>> starting the domU.
>>
>> Yeah, its a WARN_ONCE().

Ok, false alarm. It's there, but not caused by this patch.

I changed the WARN_ONCE to WARN for funs, and now I get it a lot more
already (v2):

https://paste.debian.net/plainh/d535a379

>> And you didn't see it with v1 of the patch?
>
> No.

I was wrong. I tried a bit more, and I can also reproduce without v1 or
v2 patch at all, and I can reproduce it with v4.19.9. Just sometimes
needs a dozen times more live migrating it before it shows up.

I cannot make it happen to show up with the Debian 4.19.9 distro kernel,
that's why it was new for me.

So, let's ignore it in this thread now.

>> On the first glance this might be another bug just being exposed by
>> my patch.
>>
>> I'm investigating further, but this might take some time. Could you
>> meanwhile verify the same happens with kernel 5.0-rc1? That was the
>> one I tested with and I didn't spot that WARN.
>
> I have Linux 5.0-rc1 with v2 on top now, which gives me this on live
> migrate:
>
> [...]
> [ 51.871076] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000098
> [ 51.871091] #PF error: [normal kernel read fault]
> [ 51.871100] PGD 0 P4D 0
> [ 51.871109] Oops: 0000 [#1] SMP NOPTI
> [ 51.871117] CPU: 0 PID: 36 Comm: xenwatch Not tainted 5.0.0-rc1 #1
> [ 51.871132] RIP: e030:blk_mq_map_swqueue+0x103/0x270
> [...]

Dunno about all the 5.0-rc1 crashes yet. I can provide more feedback
about that if you want, but not in here I presume.

Hans