Re: [Problem] Cache line starvation

From: Kurt Kanzenbach
Date: Thu Sep 27 2018 - 10:41:33 EST


Hi Will,

On Thu, Sep 27, 2018 at 04:25:47PM +0200, Kurt Kanzenbach wrote:
> Hi Will,
>
> On Wed, Sep 26, 2018 at 01:53:02PM +0100, Will Deacon wrote:
> > Hi all,
> >
> > On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > > cores).
> > >
> > > Instrumentation show always the picture:
> > >
> > > CPU0 CPU1
> > > => do_syscall_64 => do_syscall_64
> > > => SyS_ptrace => syscall_slow_exit_work
> > > => ptrace_check_attach => ptrace_do_notify / rt_read_unlock
> > > => wait_task_inactive rt_spin_lock_slowunlock()
> > > -> while task_running() __rt_mutex_unlock_common()
> > > / check_task_state() mark_wakeup_next_waiter()
> > > | raw_spin_lock_irq(&p->pi_lock); raw_spin_lock(&current->pi_lock);
> > > | . .
> > > | raw_spin_unlock_irq(&p->pi_lock); .
> > > \ cpu_relax() .
> > > - .
> > > *IRQ* <lock acquired>
> > >
> > > In the error case we observe that the while() loop is repeated more than
> > > 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> > > other side does not make progress waiting for the same lock with interrupts
> > > disabled.
> > >
> > > This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ
> > > the other CPU is able to acquire pi_lock and the situation relaxes.
> > >
> > > Peter suggested to do a clwb(&p->pi_lock); before the cpu_relax() in
> > > wait_task_inactive() which on both the Core2Duo and the SKL gets runtime
> > > patched to clflush(). That hides it as well.
> >
> > Given the broadcast nature of cache-flushing, I'd be pretty nervous about
> > adding it on anything other than a case-by-case basis. That doesn't sound
> > like something we'd want to maintain... It would also be interesting to know
> > whether the problem is actually before the cache (i.e. if the lock actually
> > sits in the store buffer on CPU0). Does MFENCE/DSB after the unlock() help at
> > all?
> >
> > We've previously seen something similar to this on arm64 in big/little
> > systems where the big cores can loop around and re-take a spinlock before
> > the little guys can get in the queue or take a ticket. I bodged that in
> > cpu_relax(), but there's a magic heuristic which I couldn't figure out how
> > to specify:
> >
> > https://lkml.org/lkml/2017/7/28/172
> >
> > For A72 (which is the core I think you're using) it would be interesting to
> > try both:
> >
> > (1) Removing the prfm instruction from spin_lock(), and
> > (2) Setting bit 42 of CPUACTLR_EL1 on each CPU (probably needs a
> > firmware change)
>
> correct, we use the Cortex A72.
>
> I followed your suggestions. I've removed the prefetch instructions from
> the spin lock implementation in the v4.9 kernel. In addition I've
> modified armv8/start.S in U-Boot to setup bit 42 in CPUACTLR_EL1
> (S3_1_c15_c2_0). We've also made sure, that this bit is actually written
> for each CPU by reading their register value in the kernel.
>
> However, the issue still triggers fine. With stress-ng we're able to
> generate latency in millisecond range. The only workaround we've found
> so far is to add a "delay" in cpu_relax().

It might interesting for you, how we added the delay. We've used:

static inline void cpu_relax(void)
{
volatile int i = 0;

asm volatile("yield" ::: "memory");
while (i++ <= 1000);
}

Of course it's not efficient, but it works.

Thanks,
Kurt

>
> Any ideas, what we can test further?
>
> Thanks,
> Kurt
>
> >
> > That should prevent the lock() operation from speculatively pulling in the
> > cacheline in a unique state.
> >
> > More recent Arm CPUs have atomic instructions which, apart from CAS,
> > *should* avoid this starvation issue entirely.
> >
> > Will
> >