Re: [PATCH v2] Track hard and soft "short lockups" or "stalls."

From: ZAK Magnus
Date: Wed Jul 20 2011 - 17:15:25 EST


On Wed, Jul 20, 2011 at 2:07 PM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
> On Wed, Jul 20, 2011 at 12:41:39PM -0700, ZAK Magnus wrote:
>> Are the stack traces very different? I don't understand in what sense
>> it's confusing.
>
> The fact that there are 3 of them telling me the samething.  Most people
> look at the first stack trace to figure out what is going on and will just
> notice the warning.  They might completely miss the HARDLOCKUP message on
> the third stack trace down.
>
> It just looked odd when I ran it the first time.  I feel like I would
> constantly be trying to educate people on why we do it like that.
Oh, okay. So, maybe the stall warnings should say something like, "a
lockup might happen soon?" Would that help? I don't know.

>> I don't think that exact patch would work (wouldn't it cause
>> update_hardstall to only ever be called with 0 as its first argument?)
>> but I hope I still understand what you're saying. You're saying stalls
>> should only be recorded once they're finished, right? I don't know if
>> this is the best approach. If we wait until interrupts stop being
>> missed, it means the code could have exited whatever section caused
>> the stall to begin with. Maybe your data indicates otherwise, but I
>> would think this means the stack trace would not really be
>
> Crap.  good point.
Which part, exactly?

>> informative. It's one thing to know a stall occurs, but its occurrence
>> is generally reflective of a bug or a suboptimal section, so it would
>> be good to know where that is in order to try and fix it.
>>
>> For soft stalls, I think the same is true. Also, since the soft lockup
>> system just relies on checking a timestamp compared to now, it can't
>> know how long a stall was after it has already finished. The hard
>> system only knows because it keeps a running count of the number of
>> failed checks. An additional timestamp could be introduced and the
>> difference between the two retroactively checked in order to reproduce
>> this, but the stack trace issue would still apply. Also, while not
>> hugely complex, the change would be more significant than the sort
>> your patch presents.
>>
>> The bottom line is that I think catching a stall in progress is the
>> most informative thing to do, and I don't understand the downsides of
>> doing so. Could you please explain them?
>>
>> On another note, I'm working on a patch on top of this one which would
>> change the hard lockup system to be more like the soft lockup system.
>> It would use a timestamp as well, so it can have a more exact read on
>> how long the timer has been delayed. This adds resolution and gets rid
>> of that problem where it can only report missed = 3 or 4. Any
>> preliminary comments? Or should I just put the patch up before
>> discussing it?
>
> That might work.  I would have to see the patch.  What clock would you use
> to read the time?  I don't think you can use 'now' if interrupts are
> disabled.
Okay, I will send it when it seems ready. For the timestamp, I was
just using the get_timestamp function that's defined in the file,
which calls cpu_clock(). Is there a better way?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/