Re: Slow vmalloc in 2.6.35-rc3

From: Nick Piggin
Date: Thu Jun 24 2010 - 11:14:41 EST


On Thu, Jun 24, 2010 at 12:19:32PM +0300, Avi Kivity wrote:
> I see really slow vmalloc performance on 2.6.35-rc3:

Can you try this patch?
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmap-area-cache.patch


> # tracer: function_graph
> #
> # CPU DURATION FUNCTION CALLS
> # | | | | | | |
> 3) 3.581 us | vfree();
> 3) | msr_io() {
> 3) ! 523.880 us | vmalloc();
> 3) 1.702 us | vfree();
> 3) ! 529.960 us | }
> 3) | msr_io() {
> 3) ! 564.200 us | vmalloc();
> 3) 1.429 us | vfree();
> 3) ! 568.080 us | }
> 3) | msr_io() {
> 3) ! 578.560 us | vmalloc();
> 3) 1.697 us | vfree();
> 3) ! 584.791 us | }
> 3) | msr_io() {
> 3) ! 559.657 us | vmalloc();
> 3) 1.566 us | vfree();
> 3) ! 575.948 us | }
> 3) | msr_io() {
> 3) ! 536.558 us | vmalloc();
> 3) 1.553 us | vfree();
> 3) ! 542.243 us | }
> 3) | msr_io() {
> 3) ! 560.086 us | vmalloc();
> 3) 1.448 us | vfree();
> 3) ! 569.387 us | }
>
> msr_io() is from arch/x86/kvm/x86.c, allocating at most 4K (yes it
> should use kmalloc()). The memory is immediately vfree()ed. There
> are 96 entries in /proc/vmallocinfo, and the whole thing is single
> threaded so there should be no contention.

Yep, it should use kmalloc.


> Here's the perf report:
>
> 63.97% qemu [kernel]
> [k] rb_next
> |
> --- rb_next
> |
> |--70.75%-- alloc_vmap_area
> | __get_vm_area_node
> | __vmalloc_node
> | vmalloc
> | |
> | |--99.15%-- msr_io
> | | kvm_arch_vcpu_ioctl
> | | kvm_vcpu_ioctl
> | | vfs_ioctl
> | | do_vfs_ioctl
> | | sys_ioctl
> | | system_call
> | | __GI_ioctl
> | | |
> | | --100.00%--
> 0x1dfc4a8878e71362
> | |
> | --0.85%-- __kvm_set_memory_region
> | kvm_set_memory_region
> |
> kvm_vm_ioctl_set_memory_region
> | kvm_vm_ioctl
> | vfs_ioctl
> | do_vfs_ioctl
> | sys_ioctl
> | system_call
> | __GI_ioctl
> |
> --29.25%-- __get_vm_area_node
> __vmalloc_node
> vmalloc
> |
> |--98.89%-- msr_io
> | kvm_arch_vcpu_ioctl
> | kvm_vcpu_ioctl
> | vfs_ioctl
> | do_vfs_ioctl
> | sys_ioctl
> | system_call
> | __GI_ioctl
> | |
> | --100.00%--
> 0x1dfc4a8878e71362
>
>
> It seems completely wrong - iterating 8 levels of a binary tree
> shouldn't take half a millisecond.

It's not iterating down the tree, it's iterating through the
nodes to find a free area. Slows down because lazy vunmap means
that quite a lot of little areas build up right at the start of
our search start address. The vmap cache should hopefully fix
it up.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/