Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

From: Will Deacon
Date: Wed Feb 03 2016 - 08:32:17 EST


On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote:
> * Will Deacon <will.deacon@xxxxxxx> wrote:
> > On Tue, Feb 02, 2016 at 10:06:36AM -0800, Linus Torvalds wrote:
> > > On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon <will.deacon@xxxxxxx> wrote:
> > > >
> > > > Given that the vast majority of weakly ordered architectures respect
> > > > address dependencies, I would expect all of them to be hurt if they
> > > > were forced to use barrier instructions instead, even those where the
> > > > microarchitecture is fairly strongly ordered in practice.
> > >
> > > I do wonder if it would be all that noticeable, though. I don't think
> > > we've really had benchmarks.
> > >
> > > For example, most of the RCU list traversal shows up on x86 - where
> > > loads are already acquires. But they show up not because of that, but
> > > because a RCU list traversal is pretty much always going to take the
> > > cache miss.
> > >
> > > So it would actually be interesting to just try it - what happens to
> > > kernel-centric benchmarks (which are already fairly rare) on arm if we
> > > change the rcu_dereference() to be a smp_load_acquire()?
> > >
> > > Because maybe nothing happens at all. I don't think we've ever tried it.
> >
> > FWIW, and this is by no means conclusive, I hacked that up quickly and ran
> > hackbench a few times on the nearest idle arm64 system. The results were
> > consistently ~4% slower using acquire for rcu_dereference.
>
> Could you please double check that? The thing is that hackbench is a _notoriously_
> unstable workload and very dependent on various small details such as kernel image
> layout and random per-bootup cache/memory layouts details.

Yup. FWIW, I did try across a few reboots to arrive at the 4% figure, but
it's not conclusive (as I said :).

> In fact I'd suggest to test this via a quick runtime hack like this in rcupdate.h:
>
> extern int panic_timeout;
>
> ...
>
> if (panic_timeout)
> smp_load_acquire(p);
> else
> typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p);
>
> (or so)

So the problem with this is that a LOAD <ctrl> LOAD sequence isn't an
ordering hazard on ARM, so you're potentially at the mercy of the branch
predictor as to whether you get an acquire. That's not to say it won't
be discarded as soon as the conditional is resolved, but it could
screw up the benchmarking.

I'd be better off doing some runtime patching, but that's not something
I can knock up in a couple of minutes (so I'll add it to my list).

> and you could get specific numbers of noise estimations via:
>
> triton:~/tip> perf stat --null --repeat 10 perf bench sched messaging -l 10000
>
> [...]
>
> Performance counter stats for 'perf bench sched messaging -l 10000' (10 runs):
>
> 4.616404309 seconds time elapsed ( +- 1.67% )
>
> note that even with a repeat count of 10 runs and a loop count 100 times larger
> than the hackbench default, the intrinsic noise of this workload was still 1.6% -
> and that does not include boot-to-boot systematic noise.

That's helpful, thanks.

Will