Re: [RFC PATCH 0/4] Gang scheduling in CFS

From: Avi Kivity
Date: Sun Dec 25 2011 - 05:58:48 EST


On 12/23/2011 12:36 PM, Ingo Molnar wrote:
> * Nikunj A Dadhania <nikunj@xxxxxxxxxxxxxxxxxx> wrote:
>
> > Here some interesting perf reports from inside the guest:
> >
> > Baseline:
> > 29.79% ebizzy [kernel.kallsyms] [k] native_flush_tlb_others
> > 18.70% ebizzy libc-2.12.so [.] __GI_memcpy
> > 7.23% ebizzy [kernel.kallsyms] [k] get_page_from_freelist
> > 5.38% ebizzy [kernel.kallsyms] [k] __do_page_fault
> > 4.50% ebizzy [kernel.kallsyms] [k] ____pagevec_lru_add
> > 3.58% ebizzy [kernel.kallsyms] [k] default_send_IPI_mask_logical
> > 3.26% ebizzy [kernel.kallsyms] [k] native_flush_tlb_single
> > 2.82% ebizzy [kernel.kallsyms] [k] handle_pte_fault
> > 2.16% ebizzy [kernel.kallsyms] [k] kunmap_atomic
> > 2.10% ebizzy [kernel.kallsyms] [k] _spin_unlock_irqrestore
> > 1.90% ebizzy [kernel.kallsyms] [k] down_read_trylock
> > 1.65% ebizzy [kernel.kallsyms] [k] __mem_cgroup_commit_charge.clone.4
> > 1.60% ebizzy [kernel.kallsyms] [k] up_read
> > 1.24% ebizzy [kernel.kallsyms] [k] __alloc_pages_nodemask
> >
> > Gang:
> > 22.53% ebizzy libc-2.12.so [.] __GI_memcpy
> > 9.73% ebizzy [kernel.kallsyms] [k] ____pagevec_lru_add
> > 8.22% ebizzy [kernel.kallsyms] [k] get_page_from_freelist
> > 7.80% ebizzy [kernel.kallsyms] [k] default_send_IPI_mask_logical
> > 7.68% ebizzy [kernel.kallsyms] [k] native_flush_tlb_others
> > 6.22% ebizzy [kernel.kallsyms] [k] __do_page_fault
> > 5.54% ebizzy [kernel.kallsyms] [k] native_flush_tlb_single
> > 4.44% ebizzy [kernel.kallsyms] [k] _spin_unlock_irqrestore
> > 2.90% ebizzy [kernel.kallsyms] [k] kunmap_atomic
> > 2.78% ebizzy [kernel.kallsyms] [k] __mem_cgroup_commit_charge.clone.4
> > 2.76% ebizzy [kernel.kallsyms] [k] handle_pte_fault
> > 2.16% ebizzy [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
> > 1.59% ebizzy [kernel.kallsyms] [k] down_read_trylock
> > 1.43% ebizzy [kernel.kallsyms] [k] up_read
> >
> > I see the main difference between both the reports is:
> > native_flush_tlb_others.
>
> So it would be important to figure out why ebizzy gets into so
> many TLB flushes and why gang scheduling makes it go away.

The second part is easy - a remote tlb flush involves IPIs to many other
vcpus (possible waking them up and scheduling them), then busy-waiting
until they acknowledge the flush. Gang scheduling is really good here
since it shortens the busy wait, would be even better if we schedule
halted vcpus (see the yield_on_hlt module parameter, set to 0).
Directed yield on PLE should provide intermediate results between doing
nothing and gang sched.

The first part appears to be unrelated to ebizzy itself - it's the
kunmap_atomic() flushing ptes. It could be eliminated by switching to a
non-highmem kernel, or by allocating more PTEs for kmap_atomic() and
batching the flush.

btw you can get an additional speedup by enabling x2apic, for
default_send_IPI_mask_logical().

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/