Re: [PATCH v5 2/2] kvm: Use huge pages for DAX-backed files

From: Barret Rhoden
Date: Thu Dec 12 2019 - 14:55:10 EST


Hi -

On 12/12/19 1:49 PM, Liran Alon wrote:


On 12 Dec 2019, at 20:47, Liran Alon <liran.alon@xxxxxxxxxx> wrote:



On 12 Dec 2019, at 20:22, Barret Rhoden <brho@xxxxxxxxxx> wrote:

This change allows KVM to map DAX-backed files made of huge pages with
huge mappings in the EPT/TDP.

This change isnât only relevant for TDP. It also affects when KVM use shadow-paging.
See how FNAME(page_fault)() calls transparent_hugepage_adjust().

Cool, I'll drop references to the EPT/TDP from the commit message.

DAX pages are not PageTransCompound. The existing check is trying to
determine if the mapping for the pfn is a huge mapping or not.

I would rephrase âThe existing check is trying to determine if the pfn
is mapped as part of a transparent huge-pageâ.

Can do.


For
non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.

This is not related to hugetlbfs but rather THP.

I thought that PageTransCompound also returned true for hugetlbfs (based off of comments in page-flags.h). Though I do see the comment about the 'level == PT_PAGE_TABLE_LEVEL' check excluding hugetlbfs pages.

Anyway, I'll remove the "e.g. hugetlbfs" from the commit message.


For DAX, we can check the page table itself.

Note that KVM already faulted in the page (or huge page) in the host's
page table, and we hold the KVM mmu spinlock. We grabbed that lock in
kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.

Signed-off-by: Barret Rhoden <brho@xxxxxxxxxx>

I donât think the right place to change for this functionality is transparent_hugepage_adjust()
which is meant to handle PFNs that are mapped as part of a transparent huge-page.

For example, this would prevent mapping DAX-backed file page as 1GB.
As transparent_hugepage_adjust() only handles the case (level == PT_PAGE_TABLE_LEVEL).

As you are parsing the page-tables to discover the page-size the PFN is mapped in,
I think you should instead modify kvm_host_page_size() to parse page-tables instead
of rely on vma_kernel_pagesize() (Which relies on vma->vm_ops->pagesize()) in case
of is_zone_device_page().
The main complication though of doing this is that at this point you donât yet have the PFN
that is retrieved by try_async_pf(). So maybe you should consider modifying the order of calls
in tdp_page_fault() & FNAME(page_fault)().

-Liran

Or alternatively when thinking about it more, maybe just rename transparent_hugepage_adjust()
to not be specific to THP and better handle the case of parsing page-tables changing mapping-level to 1GB.
That is probably easier and more elegant.

I can rename it to hugepage_adjust(), since it's not just THP anymore.

I was a little hesitant to change the this to handle 1 GB pages with this patchset at first. I didn't want to break the non-DAX case stuff by doing so.

Specifically, can a THP page be 1 GB, and if so, how can you tell? If you can't tell easily, I could walk the page table for all cases, instead of just zone_device().

I'd also have to drop the "level == PT_PAGE_TABLE_LEVEL" check, I think, which would open this up to hugetlbfs pages (based on the comments). Is there any reason why that would be a bad idea?

Thanks,

Barret