VMI Interface Proposal Documentation for I386, Part 3

From: Zachary Amsden
Date: Mon Mar 13 2006 - 14:51:34 EST

Next message: Andrew Morton: "Re: More than 8 CPUs detected and CONFIG_X86_PC cannot handle it on2.6.16-rc6"
Previous message: matthieu castet: "Re: libata PATA patch for 2.6.16-rc5"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Time Interface.

In a virtualized environment, virtual machines (VM) will time share
the system with each other and with other processes running on the
host system. Therefore, a VM's virtual CPUs (VCPUs) will be
executing on the host's physical CPUs (PCPUs) for only some portion
of time. This section of the VMI exposes a paravirtual view of
time to the guest operating systems so that they may operate more
effectively in a virtual environment. The interface also provides
a way for the VCPUs to set alarms in this paravirtual view of time.

Time Domains:

a) Wallclock Time:

Wallclock time exposed to the VM through this interface indicates
the number of nanoseconds since epoch, 1970-01-01T00:00:00Z (ISO
8601 date format). If the host's wallclock time changes (say, when
an error in the host's clock is corrected), so does the wallclock
time as viewed through this interface.

b) Real Time:

Another view of time accessible through this interface is real
time. Real time always progresses except for when the VM is
stopped or suspended. Real time is presented to the guest as a
counter which increments at a constant rate defined (and presented)
by the hypervisor. All the VCPUs of a VM share the same real time
counter.

The unit of the counter is called "cycles". The unit and initial
value (corresponding to the time the VM enters para-virtual mode)
are chosen by the hypervisor so that the real time counter will not
rollover in any practical length of time. It is expected that the
frequency (cycles per second) is chosen such that this clock
provides a "high-resolution" view of time. The unit can only
change when the VM (re)enters paravirtual mode.

c) Stolen time and Available time:

A VCPU is always in one of three states: running, halted, or ready.
The VCPU is in the 'running' state if it is executing. When the
VCPU executes the HLT interface, the VCPU enters the 'halted' state
and remains halted until there is some work pending for the VCPU
(e.g. an alarm expires, host I/O completes on behalf of virtual
I/O). At this point, the VCPU enters the 'ready' state (waiting
for the hypervisor to reschedule it). Finally, at any time when
the VCPU is not in the 'running' state nor the 'halted' state, it
is in the 'ready' state.

For example, consider the following sequence of events, with times
given in real time:

(Example 1)

At 0 ms, VCPU executing guest code.
At 1 ms, VCPU requests virtual I/O.
At 2 ms, Host performs I/O for virtual I/0.
At 3 ms, VCPU executes VMI_Halt.
At 4 ms, Host completes I/O for virtual I/O request.
At 5 ms, VCPU begins executing guest code, vectoring to the interrupt
handler for the device initiating the virtual I/O.
At 6 ms, VCPU preempted by hypervisor.
At 9 ms, VCPU begins executing guest code.

From 0 ms to 3 ms, VCPU is in the 'running' state. At 3 ms, VCPU
enters the 'halted' state and remains in this state until the 4 ms
mark. From 4 ms to 5 ms, the VCPU is in the 'ready' state. At 5
ms, the VCPU re-enters the 'running' state until it is preempted by
the hypervisor at the 6 ms mark. From 6 ms to 9 ms, VCPU is again
in the 'ready' state, and finally 'running' again after 9 ms.

Stolen time is defined per VCPU to progress at the rate of real
time when the VCPU is in the 'ready' state, and does not progress
otherwise. Available time is defined per VCPU to progress at the
rate of real time when the VCPU is in the 'running' and 'halted'
states, and does not progress when the VCPU is in the 'ready'
state.

So, for the above example, the following table indicates these time
values for the VCPU at each ms boundary:

Real time Stolen time Available time
0 0 0
1 0 1
2 0 2
3 0 3
4 0 4
5 1 4
6 1 5
7 2 5
8 3 5
9 4 5
10 4 6

Notice that at any point:
real_time == stolen_time + available_time

Stolen time and available time are also presented as counters in
"cycles" units. The initial value of the stolen time counter is 0.
This implies the initial value of the available time counter is the
same as the real time counter.

Alarms:

Alarms can be set (armed) against the real time counter or the
available time counter. Alarms can be programmed to expire once
(one-shot) or on a regular period (periodic). They are armed by
indicating an absolute counter value expiry, and in the case of a
periodic alarm, a non-zero relative period counter value. [TBD:
The method of wiring the alarms to an interrupt vector is dependent
upon the virtual interrupt controller portion of the interface.
Currently, the alarms may be wired as if they are attached to IRQ0
or the vector in the local APIC LVTT. This way, the alarms can be
used as drop in replacements for the PIT or local APIC timer.]

Alarms are per-vcpu mechanisms. An alarm set by vcpu0 will fire
only on vcpu0, while an alarm set by vcpu1 will only fire on vcpu1.
If an alarm is set relative to available time, its expiry is a
value relative to the available time counter of the vcpu that set
it.

The interface includes a method to cancel (disarm) an alarm. On
each vcpu, one alarm can be set against each of the two counters
(real time and available time). A vcpu in the 'halted' state
becomes 'ready' when any of its alarm's counters reaches the
expiry.

An alarm "fires" by signaling the virtual interrupt controller. An
alarm will fire as soon as possible after the counter value is
greater than or equal to the alarm's current expiry. However, an
alarm can fire only when its vcpu is in the 'running' state.

If the alarm is periodic, a sequence of expiry values,

E(i) = e0 + p * i , i = 0, 1, 2, 3, ...

where 'e0' is the expiry specified when setting the alarm and 'p'
is the period of the alarm, is used to arm the alarm. Initially,
E(0) is used as the expiry. When the alarm fires, the next expiry
value in the sequence that is greater than the current value of the
counter is used as the alarm's new expiry.

One-shot alarms have only one expiry. When a one-shot alarm fires,
it is automatically disarmed.

Suppose an alarm is set relative to real time with expiry at the 3
ms mark and a period of 2 ms. It will expire on these real time
marks: 3, 5, 7, 9. Note that even if the alarm does not fire
during the 5 ms to 7 ms interval, the alarm can fire at most once
during the 7 ms to 9 ms interval (unless, of course, it is
reprogrammed).

If an alarm is set relative to available time with expiry at the 1
ms mark (in available time) and with a period of 2 ms, then it will
expire on these available time marks: 1, 3, 5. In the scenario
described in example 1, those available time values correspond to
these values in real time: 1, 3, 6.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew Morton: "Re: More than 8 CPUs detected and CONFIG_X86_PC cannot handle it on2.6.16-rc6"
Previous message: matthieu castet: "Re: libata PATA patch for 2.6.16-rc5"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]