Re: [BUG] long freezes on thinkpad t60

From: Chuck Ebbert
Date: Fri Jun 15 2007 - 17:26:18 EST


On 06/14/2007 12:04 PM, Miklos Szeredi wrote:
> I've got some more info about this bug. It is gathered with
> nmi_watchdog=2 and a modified nmi_watchdog_tick(), which instead of
> calling die_nmi() just prints a line and calls show_registers().
>
> This makes the machine actually survive the NMI tracing. The attached
> traces are gathered over about an hour of stressing. An mp3 player is
> also going on continually, and I can hear a couple of seconds of
> "looping" quite often, but it gets as far as the NMI trace only
> rarely. AFAICS only the last pair shows a trace for both CPUs during
> the same "freeze".
>
> I've put some effort into understanding what's going on, but I'm not
> familiar with how interrupts work and that sort of thing.
>
> The pattern that emerges is that on CPU0 we have an interrupt, which
> is trying to acquire the rq lock, but can't.
>
> On CPU1 we have strace which is doing wait_task_inactive(), which sort
> of spins acquiring and releasing the rq lock. I've checked some of
> the traces and it is just before acquiring the rq lock, or just after
> releasing it, but is not actually holding it.
>
> So is it possible that wait_task_inactive() could be starving the
> other waiters of the rq spinlock? Any ideas?

Spinlocks aren't fair, so this kind of problem is always a possibility.
I think maybe we need another kind of unlock that gives another processor
a fair chance at the lock. Some things you could try to see if they help:

- add smp_mb() after the unlock
- replace cpu_relax() with usleep()
- use an xchcg instruction to do the unlock, like i386 does when
CONFIG_X86_OOSTORE is set

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/