Re: [RFC PATCH 0/4] Gang scheduling in CFS

From: Nikunj A Dadhania
Date: Mon Feb 20 2012 - 03:08:52 EST


On Thu, 5 Jan 2012 10:10:59 +0100, Ingo Molnar <mingo@xxxxxxx> wrote:
>
> * Avi Kivity <avi@xxxxxxxxxx> wrote:
>
> > > So why wait for non-running vcpus at all? That is, why not
> > > paravirt the TLB flush such that the invalidate marks the
> > > non-running VCPU's state so that on resume it will first
> > > flush its TLBs. That way you don't have to wake it up and
> > > wait for it to invalidate its TLBs.
> >
> > That's what Xen does, but it's tricky. For example
> > get_user_pages_fast() depends on the IPI to hold off page
> > freeing, if we paravirt it we have to take that into
> > consideration.
> >
> > > Or am I like totally missing the point (I am after all
> > > reading the thread backwards and I haven't yet fully paged
> > > the kernel stuff back into my brain).
> >
> > You aren't, and I bet those kernel pages are unswappable
> > anyway.
> >
> > > I guess tagging remote VCPU state like that might be
> > > somewhat tricky.. but it seems worth considering, the whole
> > > wake and wait for flush thing seems daft.
> >
> > It's nasty, but then so is paravirt. It's hard to get right,
> > and it has a tendency to cause performance regressions as
> > hardware improves.
>
> Here it would massively improve performance - without regressing
> the scheduler code massively.
>
I tried doing an experiment with the flush_tlb_others_ipi. This depends
on Raghu's "kvm : Paravirt-spinlock support for KVM guests"
(https://lkml.org/lkml/2012/1/14/66), which has new hypercall for
kicking another vcpu out of halt.

Here are the results from non-PLE hardware. Running ebizzy
workload inside the VMs. The table shows the ebizzy score -
Records/sec.

8CPU Intel Xeon, HT disabled, 64 bit VM(8vcpu, 1G RAM)

+--------+------------+------------+-------------+
| | baseline | gang | pv_flush |
+--------+------------+------------+-------------+
| 2VM | 3979.50 | 8818.00 | 11002.50 |
| 4VM | 1817.50 | 6236.50 | 6196.75 |
| 8VM | 922.12 | 4043.00 | 4001.38 |
+--------+------------+------------+-------------+

I will be posting the results for PLE hardware as well.

Here is the patch, this still needs to be hooked with the pv_mmu_ops. So,

Not-yet-Signed-off-by: Nikunj A Dadhania <nikunj@xxxxxxxxxxxxxxxxxx>

Index: linux-tip-f4ab688-pv/arch/x86/mm/tlb.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/mm/tlb.c 2012-02-14 18:26:21.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/mm/tlb.c 2012-02-20 15:23:10.242576314 +0800
@@ -43,6 +43,7 @@ union smp_flush_state {
struct mm_struct *flush_mm;
unsigned long flush_va;
raw_spinlock_t tlbstate_lock;
+ int sender_cpu;
DECLARE_BITMAP(flush_cpumask, NR_CPUS);
};
char pad[INTERNODE_CACHE_BYTES];
@@ -116,6 +117,9 @@ EXPORT_SYMBOL_GPL(leave_mm);
*
* Interrupts are disabled.
*/
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+extern void kvm_kick_cpu(int cpu);
+#endif

/*
* FIXME: use of asmlinkage is not consistent. On x86_64 it's noop
@@ -166,6 +170,10 @@ out:
smp_mb__before_clear_bit();
cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
smp_mb__after_clear_bit();
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+ if (cpumask_empty(to_cpumask(f->flush_cpumask)))
+ kvm_kick_cpu(f->sender_cpu);
+#endif
inc_irq_stat(irq_tlb_count);
}

@@ -184,7 +192,10 @@ static void flush_tlb_others_ipi(const s

f->flush_mm = mm;
f->flush_va = va;
+ f->sender_cpu = smp_processor_id();
if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+ int loop = 1024;
+
/*
* We have to send the IPI only to
* CPUs affected.
@@ -192,8 +203,15 @@ static void flush_tlb_others_ipi(const s
apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
INVALIDATE_TLB_VECTOR_START + sender);

+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+ while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
+ cpu_relax();
+ if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
+ halt();
+#else
while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
cpu_relax();
+#endif
}

f->flush_mm = NULL;
Index: linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/kernel/kvm.c 2012-02-14 18:26:55.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c 2012-02-14 18:26:55.178450933 +0800
@@ -653,16 +653,17 @@ out:
PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);

/* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int apicid)
+void kvm_kick_cpu(int cpu)
{
+ int apicid = per_cpu(x86_cpu_to_apicid, cpu);
kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
}
+EXPORT_SYMBOL_GPL(kvm_kick_cpu);

/* Kick vcpu waiting on @lock->head to reach value @ticket */
static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
{
int cpu;
- int apicid;

add_stats(RELEASED_SLOW, 1);

@@ -671,8 +672,7 @@ static void kvm_unlock_kick(struct arch_
if (ACCESS_ONCE(w->lock) == lock &&
ACCESS_ONCE(w->want) == ticket) {
add_stats(RELEASED_SLOW_KICKED, 1);
- apicid = per_cpu(x86_cpu_to_apicid, cpu);
- kvm_kick_cpu(apicid);
+ kvm_kick_cpu(cpu);
break;
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/