Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

From: James Houghton
Date: Mon Jan 30 2023 - 13:40:23 EST

Next message: Sean Christopherson: "Re: [PATCH v2 03/11] KVM: nSVM: explicitly raise KVM_REQ_EVENT on nested VM exit if L1 doesn't intercept interrupts"
Previous message: Boqun Feng: "Re: [PATCH] locking/atomic: atomic: Use arch_atomic_{read,set} in generic atomic ops"
In reply to: Peter Xu: "Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range"
Next in thread: Peter Xu: "Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > >
> > > James,
> > >
> > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > It turns out that the THP-like scheme significantly slows down
> > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > >
> > > > This is only really a problem because this is done between
> > > > mmu_notifier_invalidate_range_start() and
> > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > access any of the 1G page while we're doing this (and it can take like
> > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > >
> > > Did you try to measure the time, or it's a quick observation from perf?
> >
> > I put some ktime_get()s in.
> >
> > >
> > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > But maybe it depends on many things.
> > >
> > > I'm curious how the 1sec is provisioned between the procedures. E.g., I
> > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > too as it should walk the smally mapped EPT pgtables.
> >
> > Somehow this doesn't take all that long (only like 10-30ms when
> > collapsing from 4K -> 1G) compared to hugetlb_collapse().
>
> Did you populate as much the EPT pgtable when measuring this?
>
> IIUC this number should be pretty much relevant to how many pages are
> shadowed to the kvm pgtables. If the EPT table is mostly empty it should
> be super fast, but OTOH it can be much slower if when it's populated,
> because tdp mmu should need to handle the pgtable leaves one by one.
>
> E.g. it should be fully populated if you have a program busy dirtying most
> of the guest pages during test migration.

That's what I was doing. I was running a workload in the guest that
just writes 8 bytes to a page and jumps ahead a few pages on all
vCPUs, touching most of its memory.

But there is more to understand; I'll collect more results. I'm not
sure why the EPT can be unmapped/collapsed so quickly.

>
> Write op should be the worst here case since it'll require the atomic op
> being applied; see kvm_tdp_mmu_write_spte().
>
> >
> > >
> > > Since we'll still keep the intermediate levels around - from application
> > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > so potentially for a very large page we can start with building 2M layers.
> > > But then collapse will need to be run at least two rounds.
> >
> > That's exactly what I thought to do. :) I realized, too, that this is
> > actually how userspace *should* collapse things to avoid holding up
> > vCPUs too long. I think this is a good reason to keep intermediate
> > page sizes.
> >
> > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > huge difference: the THP-like scheme is about 30% slower overall.
> >
> > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > decrementing and then re-incrementing subpage->_mapcount, and then
> > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > decrement the compound_mapcount 512 times (once per PTE), then
> > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > mapcount again by 512 (once per PMD), incrementing it once.
>
> Did you have quantified numbers (with your ktime treak) to compare these?
> If we want to go the other route, I think these will be materials to
> justify any other approach on mapcount handling.

Ok, I can do that. GIve me a couple days to collect more results and
organize them in a helpful way.

(If it's helpful at all, here are some results I collected last week:
[2]. Please ignore it if it's not helpful.)

>
> >
> > The mapcount decrements are about on par with how long it takes to do
> > other things, like updating page tables. The main problem is, with the
> > THP-like scheme (implemented like this [1]), there isn't a way to
> > avoid the 262k decrements when collapsing 1G. So if we want
> > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > then I think something more clever needs to be implemented.
> >
> > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
>
> I believe the whole goal of HGM is trying to face the same challenge if
> we'll allow 1G THP exist and being able to split for anon.
>
> I don't remember whether we discussed below, maybe we did? Anyway...
>
> Another way to not use thp mapcount, nor break smaps and similar calls to
> page_mapcount() on small page, is to only increase the hpage mapcount only
> when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> entry is removed (for leaf, it's the same as for now; for HGM, it's when
> freeing pgtable of the PUD entry).

Right, and this is doable. Also it seems like this is pretty close to
the direction Matthew Wilcox wants to go with THPs.

Something I noticed though, from the implementation of
folio_referenced()/folio_referenced_one(), is that folio_mapcount()
ought to report the total number of PTEs that are pointing on the page
(or the number of times page_vma_mapped_walk returns true). FWIW,
folio_referenced() is never called for hugetlb folios.

>
> Again, in all cases I think some solid measurements would definitely be
> helpful (as commented above) to see how much overhead will there be and
> whether that'll start to become a problem at least for the current
> motivations of the whole HGM idea.
>
> Thanks,
>
> --
> Peter Xu
>

Thanks, Peter!

[2]: https://pastebin.com/raw/DVfNFi2m

- James

Next message: Sean Christopherson: "Re: [PATCH v2 03/11] KVM: nSVM: explicitly raise KVM_REQ_EVENT on nested VM exit if L1 doesn't intercept interrupts"
Previous message: Boqun Feng: "Re: [PATCH] locking/atomic: atomic: Use arch_atomic_{read,set} in generic atomic ops"
In reply to: Peter Xu: "Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range"
Next in thread: Peter Xu: "Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]