Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

From: One Thousand Gnomes
Date: Tue Jan 05 2016 - 16:43:44 EST

Next message: David Miller: "Re: [PATCH 1/1 v2] include/uapi/linux/sockios.h: mark SIOCRTMSG unused"
Previous message: David Miller: "Re: [PATCH] net: hns: avoid uninitialized variable warning:"
In reply to: Peter Zijlstra: "Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform"
Next in thread: Peter Zijlstra: "Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> It suffers the typical problems all those constructs do; namely it
> wrecks accountability.

That's "government thinking" ;-) - for most real users throughput is
more important than accountability. With the right API it ought to also
be compile time switchable.

> But here that is compounded by the fact that you inject other people's
> work into 'your' lock region, thereby bloating lock hold times. Worse,
> afaict (from a quick reading) there really isn't a bound on the amount
> of work you inject.

That should be relatively easy to fix but for this kind of lock you
normally get the big wins from stuff that is only a short amount of
executing code. The fairness your trade in the cases it is useful should
be tiny except under extreme load, where the "accountability first"
behaviour would be to fall over in a heap.

If your "lock" involves a lot of work then it probably should be a work
queue or not using this kind of locking.

> And while its a cute collapse of an MCS lock and lockless list style
> work queue (MCS after all is a lockless list), saving a few cycles from
> the naive spinlock+llist implementation of the same thing, I really
> do not see enough justification for any of this.

I've only personally dealt with such locks in the embedded space but
there it was a lot more than a few cycles because you go from

take lock
spins
pull things into cache
do stuff
cache lines go write/exclusive
unlock

take lock
move all the cache
do stuff
etc

to

take lock
queue work
pull things into cache
do work 1
caches line go write/exclusive
do work 2

unlock
done

and for the kind of stuff you apply those locks you got big improvements.
Even on crappy little embedded processors cache bouncing hurts. Even
better work merging locks like this tend to improve throughput more the
higher the contention unlike most other lock types.

The claim in the original post is 3x performance but doesn't explain
performance doing what, or which kernel locks were switched and what
patches were used. I don't find the numbers hard to believe for a big big
box, but I'd like to see the actual use case patches so it can be benched
with other workloads and also for latency and the like.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: David Miller: "Re: [PATCH 1/1 v2] include/uapi/linux/sockios.h: mark SIOCRTMSG unused"
Previous message: David Miller: "Re: [PATCH] net: hns: avoid uninitialized variable warning:"
In reply to: Peter Zijlstra: "Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform"
Next in thread: Peter Zijlstra: "Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]