Re: RFC: Ideal Adaptive Spinning Conditions

From: Gregory Haskins
Date: Thu Apr 01 2010 - 08:47:01 EST


>>> On 4/1/2010 at 01:15 AM, in message <4BB42C04.7090308@xxxxxxxxxx>, Darren Hart
<dvhltc@xxxxxxxxxx> wrote:
> Steven Rostedt wrote:
>> On Wed, 2010-03-31 at 19:13 -0700, Darren Hart wrote:
>>> Steven Rostedt wrote:
>>>> On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
>>>>
>>>>> o What type of lock hold times do we expect to benefit?
>>>> 0 (that's a zero) :-p
>>>>
>>>> I haven't seen your patches but you are not doing a heuristic approach,
>>>> are you? That is, do not "spin" hoping the lock will suddenly become
>>>> free. I was against that for -rt and I would be against that for futex
>>>> too.
>>> I'm not sure what you're getting at here. Adaptive spinning is indeed
>>> hoping the lock will become free while you are spinning and checking
>>> it's owner...
>>
>> I'm talking about the original idea people had of "lets spin for 50us
>> and hope it is unlocked before then", which I thought was not a good
>> idea.
>>
>>
>>>>> o How much contention is a good match for adaptive spinning?
>>>>> - this is related to the number of threads to run in the test
>>>>> o How many spinners should be allowed?
>>>>>
>>>>> I can share the kernel patches if people are interested, but they are
>>>>> really early, and I'm not sure they are of much value until I better
>>>>> understand the conditions where this is expected to be useful.
>>>> Again, I don't know how you implemented your adaptive spinners, but the
>>>> trick to it in -rt was that it would only spin while the owner of the
>>>> lock was actually running. If it was not running, it would sleep. No
>>>> point waiting for a sleeping task to release its lock.
>>> It does exactly this.
>>
>> OK, that's good.
>>
>>>> Is this what you did? Because, IIRC, this only benefited spinlocks
>>>> converted to mutexes. It did not help with semaphores, because
>>>> semaphores could be held for a long time. Thus, it was good for short
>>>> held locks, but hurt performance on long held locks.
>>> Trouble is, I'm still seeing performance penalties even on the shortest
>>> critical section possible (lock();unlock();)
>>
>> performance penalties compared to what? not having adaptive at all?
>
> Right. See the data in the original mail:
>
>
> futex_lock: Result: 635 Kiter/s
> futex_lock_adaptive: Result: 542 Kiter/s
>
> So 15% fewer lock/unlock iterations per second with in kernel adaptive
> spinning enabled for a critical section approaching 0 in length. But If
> we agree I'm taking the right approach, then it's time for me to polish
> things up a bit and send them out for review.

Hi Darren,

Part of the magic of adaptive locks is the avoidance of the sleep path, and part of it is using the SMP resources in what is effectively a distributed search (e.g. having N cpus actively looking for when they can acquire verses going to sleep and having the scheduler wake them up later).

I haven't seen your algorithm, but if its not simply trading the sleep path for searching for acquisition+lockowner changes, that may perhaps be a reason for an observed regression. For instance, if you have to do substantial work to get to the adaptive part of the algorithm, there could be an unbalanced amount of overhead in selecting this path.

The whole premise is predicated on the following sequence chart:

non-adaptive:

time | threadA | threadB
----------------------------------------
| lock(X) (granted) |
t0 | | lock(X) (contended)
| | add_wait_list(X, B)
| unlock(X) |
| grant(X, B) |
| wake_up(B) |
| | schedule()
| | |
| | |
| | V
t1 | | schedule() returns
| | (returns from lock())

adaptive:

time | threadA | threadB
----------------------------------------
| lock(X) (granted) |
t0 | | lock(X) (contended)
| unlock(X) |
| grant(X, B) |
| | while(!is_granted(X, B) && is_running(lock_owner(X))
t1 | | (returns from lock()

The idea is that the time interval t0-t1 is shorter in the adaptive case (at least most of the time), and is why the spinning doesn't end up hurting us (the cpu is also busy in the schedule() code otherwise). This is the win-win scenario for adaptive. This is the case generally with short-hold locks (like the spinlocks that were converted to mutexes in -rt tend to be).

For cases where the lock is held longer than the scheduling overhead, we will undoubtedly see more cpu utilization than the non-adaptive case. However, we may _still_ see performance improvements in some scenarios due to the fact that the "grant" operation still has less overhead (thead A can skip the "wakeup" and therefore thread B still comes out of the loop with less latency that it could otherwise. In this scenario you are trading cpu cycles for reduced latency, though the performance gains (if any) are not as profound in the ideal case.

..and then of course there are cpu-bound workloads with long(er) held locks. They don't like adaptive-locks much at all ;)

Long story short, I would try to quantify what your "t1-t0" delta actually is for the workload you are observing to start. That will give you a ballpark of whether you should expect any gains or not.

-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/