Re: [Problem] Cache line starvation

From: Thomas Gleixner
Date: Wed Sep 26 2018 - 04:04:49 EST


On Wed, 26 Sep 2018, Peter Zijlstra wrote:
> On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > Instrumentation show always the picture:
> >
> > CPU0 CPU1
> > => do_syscall_64 => do_syscall_64
> > => SyS_ptrace => syscall_slow_exit_work
> > => ptrace_check_attach => ptrace_do_notify / rt_read_unlock
> > => wait_task_inactive rt_spin_lock_slowunlock()
> > -> while task_running() __rt_mutex_unlock_common()
> > / check_task_state() mark_wakeup_next_waiter()
> > | raw_spin_lock_irq(&p->pi_lock); raw_spin_lock(&current->pi_lock);
> > | . .
> > | raw_spin_unlock_irq(&p->pi_lock); .
> > \ cpu_relax() .
> > - .
> > *IRQ* <lock acquired>
> >
> > In the error case we observe that the while() loop is repeated more than
> > 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> > other side does not make progress waiting for the same lock with interrupts
> > disabled.
>
> I've tried really hard to reproduce this in userspace, but so far have
> not had any luck. Looks to be a real tricky thing to make happen.

It's probably equally tricky to write a reproducer as it was to instrument
the thing. I assume it's a combination of code sequences on both CPUs which
involve other (unrelated) lock instructions on the way.

Thanks,

tglx