Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

From: Yu Zhao
Date: Thu Feb 23 2023 - 13:09:22 EST


On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Thu, Feb 16, 2023, Yu Zhao wrote:
> > An existing selftest can quickly demonstrate the effectiveness of this
> > patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM:
>
> Not my area of maintenance, but a non-existent changelog (for all intents and
> purposes) for a change of this size and complexity is not acceptable.

Will fix.

> > $ sudo max_guest_memory_test -c 64 -m 250 -s 250
> >
> > MGLRU run2
> > ---------------
> > Before ~600s
> > After ~50s
> > Off ~250s
> >
> > kswapd (MGLRU before)
> > 100.00% balance_pgdat
> > 100.00% shrink_node
> > 100.00% shrink_one
> > 99.97% try_to_shrink_lruvec
> > 99.06% evict_folios
> > 97.41% shrink_folio_list
> > 31.33% folio_referenced
> > 31.06% rmap_walk_file
> > 30.89% folio_referenced_one
> > 20.83% __mmu_notifier_clear_flush_young
> > 20.54% kvm_mmu_notifier_clear_flush_young
> > => 19.34% _raw_write_lock
> >
> > kswapd (MGLRU after)
> > 100.00% balance_pgdat
> > 100.00% shrink_node
> > 100.00% shrink_one
> > 99.97% try_to_shrink_lruvec
> > 99.51% evict_folios
> > 71.70% shrink_folio_list
> > 7.08% folio_referenced
> > 6.78% rmap_walk_file
> > 6.72% folio_referenced_one
> > 5.60% lru_gen_look_around
> > => 1.53% __mmu_notifier_test_clear_young
>
> Do you happen to know how much of the improvement is due to batching, and how
> much is due to using a walkless walk?

No. I have three benchmarks running at the moment:
1. Windows SQL server guest on x86 host,
2. Apache Spark guest on arm64 host, and
3. Memcached guest on ppc64 host.

If you are really interested in that, I can reprioritize -- I need to
stop 1) and use that machine to get the number for you.

> > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
> > if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
> > caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> >
> > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
> > + caps |= BIT(LRU_GEN_SPTE_WALK);
>
> As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> if we want to avoid batching when there are no mmu_notifier listeners, probe
> mmu_notifiers. But don't call into KVM directly.

I'm not sure I fully understand. Let's present the problem on the MM
side: assuming KVM supports lockless walks, batching can still be
worse (very unlikely), because GFNs can exhibit no memory locality at
all. So this option allows userspace to disable batching.

I fully understand why you don't want MM to call into KVM directly. No
acceptable ways to set up a clear interface between MM and KVM other
than the MMU notifier?