RE: [PATCH] x86/hyperv: Pass on the lpj value from host to guest

From: Michael Kelley (LINUX)
Date: Thu Feb 16 2023 - 21:34:29 EST


From: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, February 16, 2023 11:41 AM
>
> On Tue, Feb 14, 2023 at 04:19:13PM +0000, Michael Kelley (LINUX) wrote:
> > From: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx>
> > >
> > > And have it preset.
> > > This change allows to significantly reduce time to bring up guest SMP
> > > configuration as well as make sure the guest won't get inaccurate
> > > calibration results due to "noisy neighbour" situation.
> > >
> > > Below are the numbers for 16 VCPU guest before the patch (~1300 msec)
> > >
> > > [ 0.562938] x86: Booting SMP configuration:
> > > ...
> > > [ 1.859447] smp: Brought up 1 node, 16 CPUs
> > >
> > > and after the patch (~130 msec):
> > >
> > > [ 0.445079] x86: Booting SMP configuration:
> > > ...
> > > [ 0.575035] smp: Brought up 1 node, 16 CPUs
> > >
> > > This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
> > > paravirt function to calculate cpu khz").
> >
> > This patch has been nagging at me a bit, and I finally did some further
> > checking. Looking at Linux guests on local Hyper-V and in Azure, I see
> > a dmesg output line like this during boot:
> >
> > Calibrating delay loop (skipped), value calculated using timer frequency.. 5187.81
> BogoMIPS (lpj=2593905)
> >
> > We're already skipping the delay loop calculation because lpj_fine
> > is set in tsc_init(), using the results of get_loops_per_jiffy(). The
> > latter does exactly the same calculation as hv_preset_lpj() in
> > this patch.
> >
> > Is this patch arising from an environment where tsc_init() is
> > skipped for some reason? Just trying to make sure we fully
> > when this patch is applicable, and when not.
> >
>
> The problem here is a bit different: "lpj_fine" is considered only for
> the boot CPU (from init/calibrate.c):
>
> } else if ((!printed) && lpj_fine) {
> lpj = lpj_fine;
> pr_info("Calibrating delay loop (skipped), "
> "value calculated using timer frequency.. ");
>
> while all the secondary ones use the timer to calibrate.
>
> With this change lpj_preset will be used for all cores (from
> init/calbrate.c):
>
> } else if (preset_lpj) {
> lpj = preset_lpj;
> if (!printed)
> pr_info("Calibrating delay loop (skipped) "
> "preset value.. ");
>
> This lofic with lpj_fine comes from commit 3da757daf86e ("x86: use
> cpu_khz for loops_per_jiffy calculation"), where the commit messages
> states the following:
>
> We do this only for the boot processor because the AP's can have
> different base frequencies or the BIOS might boot a AP at a different
> frequency.
>
> Hope this helps.
>

Indeed, you are right about lpj_fine being applied only to the boot
CPU. So I've looked a little closer because I don't see the 1300
milliseconds you see for a 16 vCPU guest.

I've been experimenting with a 32 vCPU guest, and without your
patch, it takes only 26 milliseconds to get all 32 vCPUs started. I
think the trick is in the call to calibrate_delay_is_known(). This
function copies the lpj value from a CPU in the same NUMA node
that has already been calibrated, assuming that constant_tsc is
set, which is the case in my test VM. So the boot CPU sets lpj
based on lpj_fine, and all other CPUs effectively copy the value
from the boot CPU without doing calibration.

I also experimented with multiple NUMA nodes. In that case, it
does take a longer. Dividing the 32 vCPUs into 4 NUMA nodes,
it takes about 210 miliseconds to boot all 32 vCPUs. Presumably the
extra time is due to timer-based calibration being done once for each
NUMA node, plus probably some misc NUMA accounting overhead.
With preset_lpj set, that 210 milliseconds drops to 32 milliseconds,
which is more like the case with only 1 NUMA nodes, so there's some
modest benefit with multiple NUMA nodes.

Could you check if constant_tsc is set in your test environment? It
really should be set in a Hyper-V VM.

Michael