Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

From: Jeremy Fitzhardinge
Date: Wed Mar 07 2007 - 16:07:53 EST


Thomas Gleixner wrote:
> I tend to disagree. The clockevents infrastructure was designed to cope
> with the existing mess of real hardware. The discussion over the last
> days exposed me to even more exotic designs than the hardware vendors
> were able to deliver until now.
>

It's a different but related problem domain. It's also an increasingly
common execution environment for a kernel to find itself in. Dealing
with proper paravirtualized timer devices is a big improvement over
trying to reliably deal with fully virtualized hardware timers, which
simply can't make the same guarantees that real hardware can make - such
as "you will definitely get N ns of CPU time between doing the
delta->absolute computation and programming the match register".

> I know exactly where you are heading:
>
> Offload the handling of hypervisor design decisions to the kernel and
> let us deal with that. So we need to implement 128 bit math to convert
> back and forth and I expect more interesting things to creep up.
>

I wouldn't put it that way. We've been getting a lot of pressure to
keep the pv_ops interface as small as possible. Reusing existing kernel
interfaces rather than making up new ones is a good way to do that. The
clock infrastructure certainly cleans things up; earlier Xen patches
made a complete copy of the old kernel/time.c and hacked it around,
which isn't what anyone wants to do.

> All this is of _NO_ use and benefit for the kernel itself.
>

Lots of people want to run Linux in virtual machines. If we can make
sane kernel changes to help those users, then that is of use an benefit
to the kernel.

> Real hardware copes well with relative deltas for the events, even when
> it is match register based. I thought long about the support for
> absolute expiry values in cycles and decided against them to avoid that
> math hackery, which you folks now demand.
>

Not really. Xen and VMI interfaces both use absolute monotonic time for
timeouts, which is certainly a common case for such interfaces
(pthread_cond_timedwait, for example). Converting delta to absolute is
clearly simple, but it does introduce an added bit of non-determinism if
your CPU can be preempted from outside at any time. I presume SMM or
similar interrupts can cause the same problem on real hardware.

I guess the worst case for real hardware is an absolute-time match
register which only compares for match==now rather than match<=now,
since you could completely lose the time event if you miss the deadline.

>> static const struct clock_event_device xen_clockevent = {
>> .name = "xen",
>> .features = CLOCK_EVT_FEAT_ONESHOT,
>>
>> .max_delta_ns = 0x7fffffff,
>> .min_delta_ns = 100, /* ? */
>>
>> .mult = 1<<XEN_SHIFT,
>> .shift = XEN_SHIFT,
>>
>
> We can optimize this by skipping the conversion via a feature flag.
>

The clocksource needed the shift for ntp warping. Does the clockevent
need a shift at all? Could I just set mult/shift to 1/0?

> Your implementation is almost the perfect prototype, if you move the
> 128 bit hackery into the hypervisor and hide it away from the kernel :)
>

The point is to use the tsc to avoid making any hypercalls, so dealing
with the tsc->ns conversion has to happen on the guest side somehow.

> One of these is perfectly fine for _ALL_ of the hypervisor folks.
> Anything else is just a backwards decision for the kernel.
>

That would certainly be ideal. We'll look at the xen, vmi, lguest and
kvm paravirtualized time models and see how much they really have in
common. I'm a bit curious about how vmi's time events make their way
back into the system.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/