Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

From: Michael S. Tsirkin
Date: Tue Feb 19 2019 - 09:40:27 EST


On Tue, Feb 19, 2019 at 09:06:01AM +0100, David Hildenbrand wrote:
> On 19.02.19 00:47, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>
> >> On 18.02.19 18:31, Alexander Duyck wrote:
> >>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>>>
> >>>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> >>>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >>>>>> It would be worth a try. My feeling is that a synchronous report after
> >>>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >>>>>> s390x. (basically always enabled, nobody complains).
> >>>>>
> >>>>> What slips under the radar on an arch like s390 might
> >>>>> raise issues for a popular arch like x86. My fear would be
> >>>>> if it's only a problem e.g. for realtime. Then you get
> >>>>> a condition that's very hard to trigger and affects
> >>>>> worst case latencies.
> >>>>
> >>>> Realtime should never use free page hinting. Just like it should never
> >>>> use ballooning. Just like it should pin all pages in the hypervisor.
> >>>>
> >>>>>
> >>>>> But really what business has something that is supposedly
> >>>>> an optimization blocking a VCPU? We are just freeing up
> >>>>> lots of memory why is it a good idea to slow that
> >>>>> process down?
> >>>>
> >>>> I first want to know that it is a problem before we declare it a
> >>>> problem. I provided an example (s390x) where it does not seem to be a
> >>>> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>>>
> >>>> No trying to deny that it could be a problem on x86, but then I assume
> >>>> it is only a problem in specific setups.
> >>>>
> >>>> I would much rather prefer a simple solution that can eventually be
> >>>> disabled in selected setup than a complicated solution that tries to fit
> >>>> all possible setups. Realtime is one of the examples where such stuff is
> >>>> to be disabled either way.
> >>>>
> >>>> Optimization of space comes with a price (here: execution time).
> >>>
> >>> One thing to keep in mind though is that if you are already having to
> >>> pull pages in and out of swap on the host in order be able to provide
> >>> enough memory for the guests the free page hinting should be a
> >>> significant win in terms of performance.
> >>
> >> Indeed. And also we are in a virtualized environment already, we can
> >> have any kind of sudden hickups. (again, realtime has special
> >> requirements on the setup)
> >>
> >> Side note: I like your approach because it is simple. I don't like your
> >> approach because it cannot deal with fragmented memory. And that can
> >> happen easily.
> >>
> >> The idea I described here can be similarly be an extension of your
> >> approach, merging in a "batched reporting" Nitesh proposed, so we can
> >> report on something < MAX_ORDER, similar to s390x. In the end it boils
> >> down to reporting via hypercall vs. reporting via virtio. The main point
> >> is that it is synchronous and batched. (and that we properly take care
> >> of the race between host freeing and guest allocation)
> >
> > I'd say the discussion is even simpler then that. My concern is more
> > synchronous versus asynchronous. I honestly think the cost for a
> > synchronous call is being overblown and we are likely to see the fault
> > and zeroing of pages cost more than the hypercall or virtio
> > transaction itself.
>
> The overhead of page faults and zeroing should be mitigated by
> MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
> (context switch) becomes relevant.
>
> We have various discussions now :) And I think they are related.
>
> synchronous versus asynchronous
> batched vs. non-batched
> MAX_ORDER - 1 vs. other/none magic number
>
> 1. synchronous call without batching on every kfree is bad. The
> interface is fixed to big magic numbers, otherwise we end up having a
> hypercall on every kfree. This is your approach.
>
> 2. asynchronous calls without batching would most probably have similar
> problems with small granularities as we had when ballooning without
> batching. Just overhead we can avoid.
>
> 3. synchronous and batched is what s390x does. It can deal with page
> granularity. It is what I initially described in this sub-thread.
>
> 4. asynchronous and batched. This is the other approach we discussed
> yesterday. If we can get it implemented, I would be interested in
> performance numbers.
>
> As far as I understood, Michael seems to favor something like 4 (and I
> assume eventually 2 if it is similarly fast). I am a friend of either 3
> or 4.

Well Linus said big granularity is important for linux MM
and not to bother with hinting small sizes.

Alex said cost of a hypercall is drawfed by a pagefault
after alloc. I would be curious whether async pagefault
can help things somehow though.

> >
> > Also one reason why I am not a fan of working with anything less than
> > PMD order is because there have been issues in the past with false
> > memory leaks being created when hints were provided on THP pages that
> > essentially fragmented them. I guess hugepaged went through and
> > started trying to reassemble the huge pages and as a result there have
> > been apps that ended up consuming more memory than they would have
> > otherwise since they were using fragments of THP pages after doing an
> > MADV_DONTNEED on sections of the page.
>
> I understand your concerns, but we should not let bugs in the hypervisor
> dictate the design. Bugs are there to be fixed. Interesting read,
> though, thanks!

Right but if we break up a huge page we are then creating
more work for hypervisor to reassemble it.

> --
>
> Thanks,
>
> David / dhildenb