Re: [RFC PATCH 0/3] generic hypercall support

From: Gregory Haskins
Date: Mon May 11 2009 - 09:15:03 EST

Next message: Robert Richter: "Re: [tip:perfcounters/core] perf_counter, x86: removeX86_FEATURE_ARCH_PERFMON flag for AMD cpus"
Previous message: Ingo Molnar: "Re: [tip:x86/xen] x86: use flush_tlb_others to implementflush_tlb_all, fix"
In reply to: Anthony Liguori: "Re: [RFC PATCH 0/3] generic hypercall support"
Next in thread: Hollis Blanchard: "Re: [RFC PATCH 0/3] generic hypercall support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>> Anthony Liguori wrote:
>>
>>> I'm surprised so much effort is going into this, is there any
>>> indication that this is even close to a bottleneck in any circumstance?
>>>
>>
>> Yes. Each 1us of overhead is a 4% regression in something as trivial as
>> a 25us UDP/ICMP rtt "ping".m
>
> It wasn't 1us, it was 350ns or something around there (i.e ~1%).

I wasn't referring to "it". I chose my words carefully.

Let me rephrase for your clarity: *each* 1us of overhead introduced into
the signaling path is a ~4% latency regression for a round trip on a
high speed network (note that this can also affect throughput at some
level, too). I believe this point has been lost on you from the very
beginning of the vbus discussions.

I specifically generalized my statement above because #1 I assume
everyone here is smart enough to convert that nice round unit into the
relevant figure. And #2, there are multiple potential latency sources
at play which we need to factor in when looking at the big picture. For
instance, the difference between PF exit, and an IO exit (2.58us on x86,
to be precise). Or whether you need to take a heavy-weight exit. Or a
context switch to qemu, the the kernel, back to qemu, and back to the
vcpu). Or acquire a mutex. Or get head-of-lined on the VGA models IO.
I know you wish that this whole discussion would just go away, but these
little "300ns here, 1600ns there" really add up in aggregate despite
your dismissive attitude towards them. And it doesn't take much to
affect the results in a measurable way. As stated, each 1us costs ~4%.
My motivation is to reduce as many of these sources as possible.

So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4%
improvement. So what? Its still an improvement. If that improvement
were for free, would you object? And we all know that this change isn't
"free" because we have to change some code (+128/-0, to be exact). But
what is it specifically you are objecting to in the first place? Adding
hypercall support as an pv_ops primitive isn't exactly hard or complex,
or even very much code.

Besides, I've already clearly stated multiple times (including in this
very thread) that I agree that I am not yet sure if the 350ns/1.4%
improvement alone is enough to justify a change. So if you are somehow
trying to make me feel silly by pointing out the "~1%" above, you are
being ridiculous.

Rather, I was simply answering your question as to whether these latency
sources are a real issue. The answer is "yes" (assuming you care about
latency) and I gave you a specific example and a method to quantify the
impact.

It is duly noted that you do not care about this type of performance,
but you also need to realize that your "blessing" or
acknowledgment/denial of the problem domain has _zero_ bearing on
whether the domain exists, or if there are others out there that do care
about it. Sorry.

>
>> for request-response, this is generally for *every* packet since you
>> cannot exploit buffering/deferring.
>>
>> Can you back up your claim that PPC has no difference in performance
>> with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
>> like instructions, but clearly there are ways to cause a trap, so
>> presumably we can measure the difference between a PF exit and something
>> more explicit).
>>
>
> First, the PPC that KVM supports performs very poorly relatively
> speaking because it receives no hardware assistance

So wouldn't that be making the case that it could use as much help as
possible?

> this is not the right place to focus wrt optimizations.

Odd choice of words. I am advocating the opposite (broad solution to
many arches and many platforms (i.e. hypervisors)) and therefore I am
not "focused" on it (or really any one arch) at all per se. I am
_worried_ however, that we could be overlooking PPC (as an example) if
we ignore the disparity between MMIO and HC since other higher
performance options are not available like PIO. The goal on this
particular thread is to come up with an IO interface that works
reasonably well across as many hypervisors as possible. MMIO/PIO do not
appear to fit that bill (at least not without tunneling them over HCs)

If I am guilty of focusing anywhere too much it would be x86 since that
is the only development platform I have readily available.

>
>
> And because there's no hardware assistance, there simply isn't a
> hypercall instruction. Are PFs the fastest type of exits? Probably
> not but I honestly have no idea. I'm sure Hollis does though.
>
> Page faults are going to have tremendously different performance
> characteristics on PPC too because it's a software managed TLB.
> There's no page table lookup like there is on x86.

The difference between MMIO and "HC", and whether it is cause for
concern will continue to be pure speculation until we can find someone
with a PPC box willing to run some numbers. I will point out that we
both seem to theorize that PFs will yield lower output than
alternatives, so it would seem you are actually making my point for me.

>
> As a more general observation, we need numbers to justify an
> optimization, not to justify not including an optimization.
>
> In other words, the burden is on you to present a scenario where this
> optimization would result in a measurable improvement in a real world
> work load.

I have already done this. You seem to have chosen to ignore my
statements and results, but if you insist on rehashing:

I started this project by analyzing system traces and finding some of
the various bottlenecks in comparison to a native host. Throughput was
already pretty decent, but latency was pretty bad (and recently got
*really* bad, but I know you already have a handle on whats causing
that). I digress...one of the conclusions of the research was that I
wanted to focus on building an IO subsystem designed to minimize the
quantity of exits, minimize the cost of each exit, and shorten the
end-to-end signaling path to achieve optimal performance. I also wanted
to build a system that was extensible enough to work with a variety of
client types, on a variety of architectures, etc, so we would only need
to solve these problems "once". The end result was vbus, and the first
working example was venet. The measured performance data of this work
was as follows:

802.x network, 9000 byte MTU, 2 8-core x86_64s connected back to back
with Chelsio T3 10GE via crossover.

Bare metal : tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
Virtio-net (PCI) : tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
Venet (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt)

For more details: http://lkml.org/lkml/2009/4/21/408

You can download this today and run it, review it, compare it. Whatever
you want.

As part of that work, I measured IO performance in KVM and found HCs to
be the superior performer. You can find these results here:
http://developer.novell.com/wiki/index.php/WhyHypercalls. Without
having access to platforms other than x86, but with an understanding of
computer architecture, I speculate that the difference should be even
more profound everywhere else in lieu of a PIO primitive. And even on
the platform which should yield the least benefit (x86), the gain
(~1.4%) is not huge, but its not zero either. Therefore, my data and
findings suggest that this is not a bad optimization to consider IMO.
My final results above do not indicate to me that I was completely wrong
in my analysis.

Now I know you have been quick in the past to dismiss my efforts, and to
claim you can get the same results without needing the various tricks
and optimizations I uncovered. But quite frankly, until you post some
patches for community review and comparison (as I have done), it's just
meaningless talk.

Perhaps you are truly unimpressed with my results and will continue to
insist that my work including my final results are "virtually
meaningless". Or perhaps you have an agenda. You can keep working
against me and try to block anything I suggest by coming up with what
appears to be any excuse you can find, making rude replies on email
threads and snide comments on IRC, etc. It's simply not necessary.

Alternatively, you can work _with_ me to help try to improve KVM and
Linux (e.g. I still need someone to implement a virtio-net backend, and
who knows it better than you). The choice is yours. But lets cut the
BS because it's counter productive, and frankly, getting old.

Regards,
-Greg

Attachment: signature.asc
Description: OpenPGP digital signature

Next message: Robert Richter: "Re: [tip:perfcounters/core] perf_counter, x86: removeX86_FEATURE_ARCH_PERFMON flag for AMD cpus"
Previous message: Ingo Molnar: "Re: [tip:x86/xen] x86: use flush_tlb_others to implementflush_tlb_all, fix"
In reply to: Anthony Liguori: "Re: [RFC PATCH 0/3] generic hypercall support"
Next in thread: Hollis Blanchard: "Re: [RFC PATCH 0/3] generic hypercall support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]