Re: [Regression][Revert request] Excessive delay or hang duringresume from system suspend due to a hrti

From: Sedat Dilek
Date: Mon Jul 16 2012 - 01:16:53 EST


Hi Linus,

Please revert:

commit 5baefd6d84163443215f4a99f6a20f054ef11236
Author: John Stultz <johnstul@xxxxxxxxxx>
Date: Tue Jul 10 18:43:25 2012 -0400

hrtimer: Update hrtimer base offsets each hrtimer_interrupt

This breaks resume on the iBook G4 and Toshiba Portege R500 (at least), by
adding an excessive delay to it (the Toshiba box sometimes hangs hard during
resume from system suspend). According to Andreas

"Apparently during or before noirq resume the system is hanging by the same
amount of time as the system was sleeping."

which seems to agree with my observations.

Given that the two known-affected boxes are so different, it is quite probable
that the total number of affected systems is actually quite high.


To everyone involved: the fact that this change, which was likely to introduce
regressions from the look of it alone, has been pushed to Linus (an to -stable
at the same time!) so late in the cycle, is seriuosly disappointing.


[ /QUOTE ]


when I booted 1st into Linux-3.5-rc7 (a few hours after release) I had
a call-trace in get_next_timer_interrupt() (NULL pointer dereference)
on early-boot.
The machine got frozen.

I can't say if this is related to the same issue here, but I can
confirm after suspend + resume the machine (sandy-bridge ultrabook) I
am working on is in an unusable state.
I had to cold reboot/restart.

- Sedat -

P.S.: Unfortunately, I could not reproduce the NULL-deref again.
Thomas gave me some instruction to enable some debugobjects
kernel-options (see attached backlog from IRC).
Backlog #linux-rt (OFTC, German local-time UTC+2):

[09:27:26] <dileks> hi
[09:28:11] <dileks> tglx jstultz: with 3.5-rc7 I have a NULL pointer derefence in get_next_timer_interrupt
[09:28:19] <dileks> native_sched_clock
[09:28:31] <dileks> tick_nohz_stop_sched_tick.isra
[09:28:43] <dileks> tick_nohz_idle_enter
[09:28:46] <dileks> cpu_idle
[09:28:52] <dileks> start_secondary
[09:28:58] <dileks> machine freezes
[09:29:03] <tglx> brilliant
[09:29:05] <dileks> cold reboot/restart
[09:29:29] -*- dileks -> breakfast
[09:41:32] <-- trem (~trem@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) hat das Netzwerk verlassen (Quit: Ex-Chat)
[09:45:27] <dileks> re
[09:46:18] <dileks> tglx: any idea?
[09:47:42] <tglx> when does this happen ?
[09:47:48] <tglx> early boot ?
[09:48:06] <dileks> yes. nothing in the logs
[10:18:26] <tglx> hmm
[10:19:21] <tglx> so it explodes in get_next_timer_interrupt(), right ?
[10:20:01] <tglx> can you enable debugobjects ?
[10:20:12] <dileks> yes, as as I saw and noted on a postit scheet
[10:20:41] <tglx> DEBUG_OBJECTS
[10:20:46] <tglx> DEBUG_OBJECTS_FREE
[10:20:51] <tglx> DEBUG_OBJECTS_TIMERS
[10:22:19] <tglx> usually explosions in get_next_timer_interrupt() are caused by timers being corrupted
[10:22:52] <tglx> debug objects usually can catch it and let the box survive plus gives us proper info about the wreckage
[10:25:37] <dileks> OK
[10:26:03] <dileks> is the build-tree exploding in size?
[10:47:48] <tglx> not much
[10:48:17] <tglx> it's only the debugobject code itself plus the timer code which grows a bit
[10:48:26] <tglx> less than 1k I think

-dileks // 15-Jul-2011