Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

From: Dave Chinner
Date: Mon Aug 15 2016 - 18:22:30 EST


On Sun, Aug 14, 2016 at 10:12:20PM -0700, Linus Torvalds wrote:
> On Aug 14, 2016 10:00 PM, "Dave Chinner" <david@xxxxxxxxxxxxx> wrote:
> >
> > > What does it say if you annotate that _raw_spin_unlock_irqrestore()
> function?
> > ....
> > ¿
> > ¿ Disassembly of section load0:
> > ¿
> > ¿ ffffffff81e628b0 <load0>:
> > ¿ nop
> > ¿ push %rbp
> > ¿ mov %rsp,%rbp
> > ¿ movb $0x0,(%rdi)
> > ¿ nop
> > ¿ mov %rsi,%rdi
> > ¿ push %rdi
> > ¿ popfq
> > 99.35 ¿ nop
>
> Yeah, that's a good disassembly of a non-debug spin unlock, and the symbols
> are fine, but the profile is not valid. That's an interrupt point, right
> after the popf that enables interiors again.
>
> I don't know why 'perf' isn't working on your machine, but it clearly
> isn't.
>
> Has it ever worked on that machine?

It's working the same as it's worked since I started using it many
years ago.

> What cpu is it?

Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz

> Are you running in some
> virtualized environment without performance counters, perhaps?

I've mentioned a couple of times in this thread that I'm testing
inside a VM. It's the same VM I've been running performance tests in
since early 2010. Nobody has complained that the profiles I've
posted are useless before, and not once in all that time have they
been wrong in indicating a spinning lock contention point.

i.e. In previous cases where I've measured double digit CPU usage
numbers in a spin_unlock variant, it's always been a result of
spinlock contention. And fixing the algorithmic problem that lead to
the spinlock showing up in the profile in the first place has always
substantially improved performance and scalability.

As such, I'm always going to treat a locking profile like that as
contention because even if it isn't contending *on my machine*,
that amount of work being done under a spinning lock is /way too
much/ and it *will* cause contention problems with larger machines.

> It's not actually the unlock that is expensive, and there is no contention
> on the lock (if there had been, the numbers would have been entirely
> different for the debug case, which makes locking an order of magnitude
> more expensive). All the cost of everything that happened while interrupts
> were disabled is just accounted to the instruction after they were enabled
> again.

Right, but that does not make the profile data useless, nor you
should shoot the messenger because they weren't supplied with
information you think should have been in the message. The message
still says that the majority of the overhead is in
__remove_mapping(), and it's an excessive amount of work being done
inside the tree_lock with interrupts disabled....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx