Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lockwakeup

From: Gregory Haskins
Date: Mon Jan 10 2011 - 09:49:29 EST


Hey Steve,

Just getting back online now....

>>> On 1/3/2011 at 02:06 PM, in message
<1294081596.3948.192.camel@xxxxxxxxxxxxxxxxxxx>, Steven Rostedt
<rostedt@xxxxxxxxxxx> wrote:
> On Tue, 2010-12-28 at 07:06 -0700, Gregory Haskins wrote:
>> >>> On 12/23/2010 at 11:54 PM, in message
>
>> > Sure, but as I said, it is mostly broken anyway. I could even insert
>> > some tracepoints to show that this is always missed (heck I'll add an
>> > unlikely and do the branch profiler ;-)
>>
>> Well, I think that would be a good datapoint and is one of the things I'd
> like to see.
>
> OK, here it is, after running a simple "dbench 10":
>
> correct incorrect % Function File
> Line
> ------- --------- - -------- ---- ----
> 150 726979 99 wakeup_next_waiter rtmutex.c
> 581
>

Interesting, thanks for gathering the info


>
> Interesting that we hit 150 times that the new owner was already awake.
> Perhaps it was because the new owner was about to spin on a spinlock
> after it had put itself into the TASK_INTERRUPTIBLE state, and it was
> woken up by someone else?

I speculate that what you are seeing is the waiter was an adaptive-spinner and it beat the releaser to change
its state from INTERRUPTIBLE->RUNNING in parallel before the releaser checked the condition. Its essentially
a race condition that apparently only hits ~1% of the time. Without further instrumentation, its just a guess,
though.



>
>
>>
>> >
>> > The reason is that adaptive spinners spin in some other state than
>> > TASK_RUNNING, thus it does not help adaptive spinners at all. I first
>> > tried to fix that, but it made dbench run even slower.
>>
>> This is why I am skeptical. You are essentially asserting there are two
> issues here, IIUC:
>>
>> 1) The intent of avoiding a wakeup is broken and we take the double whammy
> of a mb()
>> plus the wakeup() anyway.
>
> Yep, this is what I believe is happening.
>
>>
>> 2) mb() is apparently slower than wakeup().
>
> Forget this point, as #1 seems to be the main issue. Also, I'm not sure
> it is safe to "fix" this, as I started to try.

ok

>
>
>>
>> I agree (1) is plausible, though I would like to see the traces to confirm.
> Its been a long time
>> since I looked at that code, but I think the original code either ran in
> RUNNING_MUTEX and was
>> inadvertently broken in the mean time or the other cpu would have
> transitioned to RUNNING on
>> its own when we flipped the owner before the release-side check was
> performed. Or perhaps
>> we just plain screwed this up and it was racy ;) I'm not sure. But as
> Peter (M) stated, it seems
>> like a shame to walk away from the concept without further investigation. I
> think everyone can
>> agree that at the very least, if it is in fact taking a double whammy we
> should fix that.
>
> [ cut out all the 2's since I'm not worried about that now, even if it
> is not a problem. ]
>
>
> Now, the way I was going to fix this is to have the adaptive wait be in
> TASK_RUNNING state, and then change the state iff we are going to sleep.
>
> But this can be a bit more race. Especially with Lai's new changes. With
> the new changes, the waiter->task does not get nulled anymore.
>
> The test to take the lock, now needs to be under the lock->wait_lock
> spinlock. We have to grab that lock, and see if the owner is null, and
> that we are the next waiter in the queue. Thus, on taking the lock we
> need to have something like this:
>
>
> if (adaptive_wait(&waiter, orig_owner))
> sleep = 1;
> else
> sleep = 0;
>
> if (sleep)
> raw_spin_lock(&lock->wait_lock);
> saved_state = rt_set_current_block_state(saved_state);
> if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
> sleep = 0;
> raw_spin_unlock(&lock->wait_lock);
> if (sleep)
> schedule_rt_mutex(lock);
> saved_state = rt_restore_current_blocked_state(saved_state);
> }
>
> Otherwise we can risk the wakeup_next_waiter() missing the wakeup.

Yep, you definitely need something like your proposal above if you want to mess with the state, I agree.

>
> To clarify, we want the adaptive_wait() to run as TASK_RUNNING. Then if
> we must sleep, then we must set the state to TASK_UNINTERRUPTIBLE, test
> again if we can still the lock, and if not then sleep. Otherwise, if a
> wakeup happens just before we set the state to TASK_UNINTERRUPTIBLE,
> then we miss the wake up all together.
>
> I can do this change, and see what impact it makes.
>
> I'm also curious if this ever worked?

Well, I _thought_ so, but it was so long ago I don't remember to specific level of granularity of the unit tests.

> If it did not, then are you sure
> your tests that show the benefit of it was true. I don't have a large
> scale box at my disposal ATM, so I can only see what this does on 4way
> machines.

Yeah, me either. At the time we had 16 and 32 core boxes, plus your 64 core box. There were certainly
substantial improvements throughout the series on these machines (which I think you also verified
independently, or you wouldn't have accepted them ;). Given that what you found amounts to a race,
I suppose the code, even if it was racy from day 1, may have had a positive impact for certain workloads
since the race will be environment specific.

I digress: whether it worked once and broke in the meantime or was always broken is purely an exercise
in evaluating my stupidity ;) Ill leave that as an exercise to the community. The important thing is to either
fix the optimization (e.g. with a patch like yours above) or remove the optimization outright.

Bottom line: Nice find, and let me know if you need me to do anything.

-Greg


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/