Re: [Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen

From: Boris Ostrovsky
Date: Tue Oct 31 2017 - 07:06:40 EST




On 10/30/2017 11:13 PM, Dongli Zhang wrote:
Hi Boris,

On 10/31/2017 08:58 AM, Boris Ostrovsky wrote:


On 10/30/2017 08:14 PM, Dongli Zhang wrote:
Hi Boris,

On 10/30/2017 09:34 PM, Boris Ostrovsky wrote:
On 10/30/2017 04:03 AM, Dongli Zhang wrote:
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
xen_steal_lock() might be less than this_rq()->prev_steal_time which is
derived from previous return value of xen_steal_clock().

For instance, steal time of each vcpu is 335 before live migration.

cpu 198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu 200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

Since runstate times are cumulative and cleared during xen live migration
by xen hypervisor, the idea of this patch is to accumulate runstate times
to global percpu variables before live migration suspend. Once guest VM is
resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
runstate times and previously accumulated times stored in global percpu
variables.

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,

which would overflow steal time and lead to 100% st usage in top command
for linux 4.8-4.10. A backport of this patch would fix that issue.

References:
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest

Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>

---
Changed since v1:
* relocate modification to xen_get_runstate_snapshot_cpu

Changed since v2:
* accumulate runstate times before live migration

Changed since v3:
* do not accumulate times in the case of guest checkpointing

Changed since v4:
* allocate array of vcpu_runstate_info to reduce number of memory allocation

---
drivers/xen/manage.c | 2 ++
drivers/xen/time.c | 68
++++++++++++++++++++++++++++++++++++++++++--
include/xen/interface/vcpu.h | 2 ++
include/xen/xen-ops.h | 1 +
4 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index c425d03..3dc085d 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -72,6 +72,7 @@ static int xen_suspend(void *data)
}
gnttab_suspend();
+ xen_accumulate_runstate_time(-1);
xen_arch_pre_suspend();
/*
@@ -84,6 +85,7 @@ static int xen_suspend(void *data)
: 0);
xen_arch_post_suspend(si->cancelled);
+ xen_accumulate_runstate_time(si->cancelled);

I am not convinced that the comment above HYPERVISOR_suspend() is
correct. The call can return an error code and so if it returns -EPERM
(which AFAICS it can't now but might in the future) then
xen_accumulate_runstate_time() will do wrong thing.

I would split xen_accumulate_runstate_time() into two functions to avoid the
-EPERM issue, as one is for saving and another is for accumulation, respectively.

Otherwise, can you use xen_accumulate_runstate_time(2) for saving before suspend
and xen_accumulate_runstate_time(si->cancelled) after resume?


I'd probably just say something like

si->cancelled = HYPERVISOR_suspend() ? 1 : 0;

As the call of HYPERVISOR_suspend() takes 3 lines, I would make it as below and
I think gcc would optimize it.

- /*
- * This hypercall returns 1 if suspend was cancelled
- * or the domain was merely checkpointed, and 0 if it
- * is resuming in a new domain.
- */
si->cancelled = HYPERVISOR_suspend(xen_pv_domain()
? virt_to_gfn(xen_start_info)
: 0);
+ si->cancelled = si->cancelled ? 1 : 0;


Or xen_manage_runstate_time(si->cancelled ? 1 : 0) --- that way you preserve si->cancelled if anyone cares later (which at the moment nooone does) Either way I think is fine.




and keep xen_accumulate_runstate_time() as is (maybe rename it to
xen_manage_runstate_time()). And also remove the comment above the hypercall as
it is incorrect (but please mention the reason in the commit message)




gnttab_resume();
if (!si->cancelled) {
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index ac5f23f..cf3afb9 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -19,6 +19,9 @@
/* runstate info updated by Xen */
static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
+static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
+static struct vcpu_runstate_info *runstate_delta;

I'd move this inside xen_accumulate_runstate_time() since that's the

If we split xen_accumulate_runstate_time() into two functions, we would leave
runstate_delta as global static.

only function that uses it. And why does it need to be
vcpu_runstate_info and not u64[4]?

This was suggested by Juergen to avoid the allocation and reclaim of the second
dimensional array as in v4 of this patch?

Or would you like to allocate sizeof(u64[4]) * num_possible_cpus() and emulate
the 2d array with this 1d array and move the pointer forward sizeof(u64[4]) in
each iteration?


I was thinking of

u64 **runstate_delta = (u64 **)kmalloc(sizeof(xen_runstate.time) *
num_possible_cpus())

and then you should be able to access runstate_delta[cpu][RUNSTATE_*].

Would the above code work?

Ugh... If course it will not, it's clearly wrong. So nevermind, sorry. Keep it the way it was originally.

-boris