Re: [syzbot] WARNING: kmalloc bug in memslot_rmap_alloc

From: Sean Christopherson
Date: Tue Sep 07 2021 - 13:30:11 EST


+Linus and Ben

On Sun, Sep 05, 2021, syzbot wrote:
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 8419 at mm/util.c:597 kvmalloc_node+0x111/0x120 mm/util.c:597
> Modules linked in:
> CPU: 0 PID: 8419 Comm: syz-executor520 Not tainted 5.14.0-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:kvmalloc_node+0x111/0x120 mm/util.c:597

...

> Call Trace:
> kvmalloc include/linux/mm.h:806 [inline]
> kvmalloc_array include/linux/mm.h:824 [inline]
> kvcalloc include/linux/mm.h:829 [inline]
> memslot_rmap_alloc+0xf6/0x310 arch/x86/kvm/x86.c:11320
> kvm_alloc_memslot_metadata arch/x86/kvm/x86.c:11388 [inline]
> kvm_arch_prepare_memory_region+0x48d/0x610 arch/x86/kvm/x86.c:11462
> kvm_set_memslot+0xfe/0x1700 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1505
> __kvm_set_memory_region+0x761/0x10e0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1668
> kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1689 [inline]
> kvm_vm_ioctl_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1701 [inline]
> kvm_vm_ioctl+0x4c6/0x2330 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4236

KVM is tripping the WARN_ON_ONCE(size > INT_MAX) added in commit 7661809d493b
("mm: don't allow oversized kvmalloc() calls"). The allocation size is absurd and
doomed to fail in this particular configuration (syzkaller is just throwing garbage
at KVM), but for humongous virtual machines it's feasible that KVM could run afoul
of the sanity check for an otherwise legitimate allocation.

The allocation in question is for KVM's "rmap" to translate a guest pfn to a host
virtual address. The size of the rmap in question is an unsigned long per 4kb page
in a memslot, i.e. on x86-64, 8 bytes per 4096 bytes of guest memory in a memslot.
With INT_MAX=0x7fffffff, KVM will trip the WARN and fail rmap allocations for
memslots >= 1tb, and Google already has VMs that create 1.5tb memslots (12tb of
total guest memory spread across 8 virtual NUMA nodes).

One caveat is that KVM's newfangled "TDP MMU" was designed specifically to avoid
the rmap allocation (among other things), precisely because of its scalability
issues. I.e. it's unlikely KVM's so called "legacy MMU" that relies on the rmaps
would be used for such large VMs. However, KVM's legacy MMU is still the only option
for shadowing nested EPT/NPT, i.e. the rmap allocation would be problematic if/when
nested virtualization is enabled in large VMs.

KVM also has other allocations based on memslot size that are _not_ avoided by KVM's
TDP MMU and may eventually be problematic, though presumably not for quite some time
as it would require petabyte? memslots. E.g. a different metadata array requires
4 bytes per 2mb of guest memory.

I don't have any clever ideas to handle this from the KVM side, at least not in the
short term. Long term, I think it would be doable to reduce the rmap size for large
memslots by 512x, but any change of that nature would be very invasive to KVM and
be fairly risky. It also wouldn't prevent syskaller from triggering this WARN at will.